PubNet: a flexible system for visualizing literature derived networks
© Douglas et al.; licensee BioMed Central Ltd. 2005
Received: 4 April 2005
Accepted: 12 July 2005
Published: 16 August 2005
We have developed PubNet, a web-based tool that extracts several types of relationships returned by PubMed queries and maps them into networks, allowing for graphical visualization, textual navigation, and topological analysis. PubNet supports the creation of complex networks derived from the contents of individual citations, such as genes, proteins, Protein Data Bank (PDB) IDs, Medical Subject Headings (MeSH) terms, and authors. This feature allows one to, for example, examine a literature derived network of genes based on functional similarity.
The amount of widely accessible scientific data has increased dramatically in recent years. There are currently more than 31,000 structures in the Protein Data Bank (PDB) , as compared with 3,000 structures 10 years ago. Swiss-Prot  now contains more than 178,000 sequence entries, which is up from 40,000 in 1994. With continual advances and refinements of experimental and computational technologies, data creation promises to accelerate for the foreseeable future.
PubMed  stands out as a key information resource in the biological sciences in terms of diversity, breadth, and manual curation. PubMed entries comprise an order of magnitude more data than the three billion bases of the human genome. In addition to basic citation and abstract information, PubMed provides rich meta-information including Medical Subject Headings (MeSH) terms, detailed affiliation, and any secondary source databanks and accession numbers of molecules discussed in each article. By parsing the XML output of a query and performing a few simple operations, it is possible to uncover many interesting relationships among publications.
Previous work has been done to augment or refine the standard PubMed search, including tools to conduct combinatorial searches  and to navigate standard search results based on common MeSH terms , gene names found in abstracts [6, 7], PubMed-assigned 'related articles' , and combinations thereof [9–12]. In PubNet we present a unique two-pronged approach in which network graphs are dynamically rendered to provide an intuitive and complete view of search results, while hyperlinking to a textual representation to allow detailed exploration of a point of interest. Multiple simultaneous queries are also supported, greatly increasing the number and types of relationships that can be visualized. The PubNet server, source code, and gallery are available on the worldwide web .
How PubNet works and interpreting the output
Generally speaking, subsets of nodes that are highly connected are drawn together in tight clusters, whereas sparsely connected nodes are spread further apart. If two queries are entered, then the degree to which the two colors overlap on the graph can also be significant. These relationships can be compared quantitatively by exporting the network to TopNet , which calculates average degree, clustering coefficient, characteristic path length, and diameter for any network. TopNet automatically scores PubNet networks by clicking the 'Export to TopNet' icon below any PubNet query result.
Hyperlinks to a textual representation of every graph are provided on its results page. The textual representation provides a summary list of all nodes and edges that comprise the network. Each entry in the summary is a hyperlink to a detailed description. For nodes, a list of outgoing edges as well as a list of all connected neighbors and their respective edges are shown, with common edges highlighted. Relevant external databank links are also provided at the top of the page. The detailed view of an edge shows a list of all nodes connected by that edge. Note that in the SVG graphical format each node is also a hyperlink to its entry in the text version of the network, which allows one to navigate quickly from an interesting region in the graph to a detailed description of its components.
Structural genomics centers attempt to solve structures at very high throughput, and each center has its own unique approach to accomplish this task. Because the PSI is still in its pilot stages, it is yet to be determined which approach is the most successful. Here we show how organizational, geographic, and social patterns of large collaborative research efforts are reflected in their publications.
Collaborative organization of single consortium
We begin by illustrating the types of relationships that can be extracted from a single query (Figure 3). A query consisting of a list of all NESG PubMed IDs was analyzed using four different combinations of node and edge types, and each yielded strikingly different graph structures. Depending on the parameters that were specified to generate the graph, these linkages may correspond to similarity between papers, frequency of copublication between two authors (for a given query), common geographic sources for publications, and so on. The scalable vector graphics formats supported by PubNet allow one to zoom in on specific regions in the graph. Each node in the graph image is hyperlinked to a detailed textual report, which includes a hyperlinked list of all outgoing edges and a list of all neighboring nodes with their respective edges. Thus, starting directly from the graphical output, it is possible to explore specific node-edge linkages in detail.
In the graph shown for the NESG consortium in Figure 3b, nodes are authors (researchers) and edges represent co-authorships on publications. It demonstrates the confederated but coordinated approach used by the NESG consortium, which includes two protein sample production centers, at least six different sites at which three-dimensional structures are determined by nuclear magnetic resonance or X-ray crystallography, and a loosely coupled group of some dozen laboratories working on various aspects of the technology development and annotation.
Comparison of several consortia
We also compare the publication authorship patterns of each of the PSI centers in Figure 4, using nodes to represent authors and edges to represent co-authorship. Because a single set of parameters was used across multiple queries, the underlying relationships between nodes are identical for each graph, and so differing graph structures correspond to variations in the global structure of these relationships. A diverse array of graph structures is evident, highlighting significant differences in size, frequency in publication, and degree of cooperation across the consortia. For example, the Tuberculosis Structural Genomics consortium  conducts its experiments in small separate groups, whereas the Joint Center for Structural Genomics  uses a more centralized approach. Groups such as the NESG  and New York Structural Genomics Research Consortium  employ an intermediate approach, in which central groups are tightly clustered but also linked to other groups in a collaborative pipeline.
A simple example with Protein Data Bank IDs
Evaluating the output of the Protein Structure Initiative
To compare PSI structures with general PDB structures, two types of graphs were generated. First, a two-query graph was generated with all available PSI structure associated PubMed IDs comprising the first query, and a random set of 300 PDB IDs comprising the second query (Figure 6a). The second type of graph was generated by running two random sets of 300 PDB IDs against each other (Figure 6b).
We have observed that differing patterns in PubNet graphs among ostensibly similar queries can reveal underlying differences derived from the content of the publications returned by each query. Major features that can vary include the degree of aggregation of nodes into different clusters (roughly indicating the subject of the protein structure) and the balance of both blue and yellow nodes within the various clusters. If PSI structure publications are indistinguishable from random PDB structure publications, then we would expect the graphs based on PSI structures publications versus random PDB structure publications to have a similar character to graphs based on two random sets of PDB structures. However, as shown in Figure 6a, the PSI structure publication nodes do not intersperse with regular PDB structure nodes as much as two sets of random structures. The PSI nodes clearly tend to aggregate in tighter neighborhoods than do the other nodes. Although this is by no means definitive, the differential clustering might indicate some underlying differences between the PSI structures and random PDB structures. One obvious source of difference in the structure publications is the fact that many PSI structures are un-annotated 'hypothetical proteins', and so they lack the MeSH terms required for greater dispersal. Another factor might be that similar methods are used to determine PSI structures, and this is reflected in their publications.
Assessing results with TopNet
In addition to examining the textual representation of the graph, qualitative assessments of the network structure can be verified by exporting the results of any PubNet query to TopNet. One particularly useful descriptor is the average degree of a network, which is the average of the degrees of each node. In a PubNet graph, node degrees increase with more common edge terms between the nodes. A high average degree indicates that the nodes are highly connected to each other. Note that the utility of many topological descriptors depends on the connectedness of a graph. For a more detailed explanation of descriptors, see the report by Yu and coworkers .
TopNet comparison of several networks
Figure 4 (NESG IDs)
Figure 4 (JCSG IDs)
Figure 4 (TB IDs)
Figure 5a (DNA pol IDs)
Figure 5b (RNA pol IDs)
Figure 6a (PSI IDs)
Figure 6b (PDB IDs 1)
Figure 6b (PDB IDs 2)
In this paper we present PubNet, a web tool that can be used to extract and visualize a variety of relationships between publications indexed by PubMed. Distinguishing features of PubNet include its ability to generate several different types of graphs based on a single query and to accommodate two queries simultaneously, which greatly facilitates graph comparison. The basic functionality of PubNet is demonstrated by its application to publications derived from the PSI, which revealed a diverse array of collaborative patterns in the different research centers as well as increased similarity between primary citations associated with those structures relative to a random sample of PDB structure citations. It is unclear whether, once properly annotated, these differences will remain.
By focusing on PSI publications we offer only a small glimpse of the possible uses of PubNet. Although only 15 combinations of node and edge parameters are currently supported, the number of different queries that can be entered is unrestricted. We have included a 'save' feature that permanently links any PubNet graph to a user gallery, and we invite the community to submit queries and comments.
We thank Dr Janet Huang for useful discussions and for her efforts in prototyping the PubNet graphs. This work is supported by grant P50-GM62413 from the Protein Structure Initiative of the National Institutes of Health, Institute of General Medical Sciences.
- Sussman JL, Lin D, Jiang J, Manning NO, Prilusky J, Ritter O, Abola EE: Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D Biol Crystallogr. 1998, 54: 1078-1084. 10.1107/S0907444998009378.PubMedView ArticleGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31: 365-370. 10.1093/nar/gkg095.PubMedPubMed CentralView ArticleGoogle Scholar
- Entrez PubMed. [http://www.ncbi.nlm.nih.gov/entrez/]
- Becker KG, Hosack DA, Dennis GJ, Lempicki RA, Bright TJ, Cheadle C, Engel J: PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics. 2003, 4: 61-10.1186/1471-2105-4-61.PubMedPubMed CentralView ArticleGoogle Scholar
- Srinivasan P: MeSHmap: a text mining tool for MEDLINE. Proc AMIA Symp. 2001, 642-646.Google Scholar
- Andrade MA, Valencia A: Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system. Proc Int Conf Intell Syst Mol Biol. 1997, 5: 25-32.PubMedGoogle Scholar
- Hoffmann R, Valencia A: A gene network for navigating the literature. Nat Genet. 2004, 36: 664-10.1038/ng0704-664.PubMedView ArticleGoogle Scholar
- HubMed. [http://www.hubmed.org/]
- Chen H, Sharp BM: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics. 2004, 5: 147-10.1186/1471-2105-5-147.PubMedPubMed CentralView ArticleGoogle Scholar
- ClusterMed. [http://clustermed.info/]
- Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001, 28: 21-28. 10.1038/88213.PubMedGoogle Scholar
- Perez-Iratxeta C, Perez AJ, Bork P, Andrade MA: Update on XplorMed: a web server for exploring scientific literature. Nucleic Acids Res. 2003, 31: 3866-3868. 10.1093/nar/gkg538.PubMedPubMed CentralView ArticleGoogle Scholar
- PubNet. [http://pubnet.gersteinlab.org/]
- aiSee. [http://www.aisee.com/]
- Yu H, Zhu X, Greenbaum D, Karro J, Gerstein M: TopNet: a tool for comparing biological sub-networks, correlating protein properties with topological statistics. Nucleic Acids Res. 2004, 32: 328-337. 10.1093/nar/gkh164.PubMedPubMed CentralView ArticleGoogle Scholar
- Protein Structure Initiative. [http://www.nigms.nih.gov/psi/]
- Rupp B, Segelke BW, Krupka HI, Lekin T, Schafer J, Zemla A, Toppani D, Snell G, Earnest T: The TB structural genomics consortium crystallization facility: towards automation from protein to electron density. Acta Crystallogr D Biol Crystallogr. 2002, 58: 1514-1518. 10.1107/S0907444902014282.PubMedView ArticleGoogle Scholar
- Lesley SA, Kuhn P, Godzik A, Deacon AM, Mathews I, Kreusch A, Spraggon G, Klock HE, McMullan D, Shin T, et al: Structural genomics of the Thermotoga maritima proteome implemented in a high-throughput structure determination pipeline. Proc Natl Acad Sci USA. 2002, 99: 11664-11669. 10.1073/pnas.142413399.PubMedPubMed CentralView ArticleGoogle Scholar
- Acton TB, Gunsalus K, Xiao R, Ma L, Aramini J, Baran MC, Chiang Y, Climent T, Cooper B, Denissova N, et al: Robotic cloning and Protein Production Platform of the Northeast Structural Genomics Consortium. Methods Enzymol. 2005, 394: 210-243.PubMedView ArticleGoogle Scholar
- Chance MR, Fiser A, Sali A, Pieper U, Eswar N, Xu G, Fajardo JE, Radhakannan T, Marinkovic N: High-throughput computational and experimental techniques in structural genomics. Genome Res. 2004, 14: 2145-2154. 10.1101/gr.2537904.PubMedPubMed CentralView ArticleGoogle Scholar
- Chen L, Oughtred R, Berman HM, Westbrook J: TargetDB: a target registration database for structural genomics projects. Bioinformatics. 2004, 20: 2860-2862. 10.1093/bioinformatics/bth300.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.