Power-law-like distributions in biomedical publications and research funding
© BioMed Central Ltd 2007
Published: 30 April 2007
Gene annotation, as measured by links to the biomedical literature and funded grants, is governed by a power law, indicating that researchers favor the extensive study of relatively few genes. This emphasizes the need for data-driven science to accomplish genome-wide gene annotation.
Following the completion of the primary sequence of the mouse and human genomes, one of the key challenges for the biomedical community is the functional annotation of all genes . With more than 650,000 citations indexed in Medline in 2005 alone, it is tempting to assume that our understanding of gene function is steadily and uniformly progressing.
Evidence of power-law relationships has been observed in many aspects of biology and natural systems - populations in cities, metabolic networks, protein-protein interactions, and the topology of the Internet (see, for example [3–5]). The observation of this pattern in the biomedical literature probably reflects an underlying natural principle. Researchers studying scale-free networks showed that a power-law relationship in the connectivity of nodes was a consequence of new nodes being preferentially attached to well connected nodes . In information science , this has been termed the 'principle of least effort', and we suggest that the power law manifests itself here on the basis of researchers' natural tendency to study that which is easy to study, previously studied genes.
If the pattern of citations in the biomedical literature is an accurate reflection of historical patterns of research, then an analysis of recent grants funded by the National Institutes of Health (NIH) will probably reveal future trends. We therefore examined the CRISP database  for all grants funded by the NIH in 2005. Because grants are not indexed by gene name, we identified CRISP keywords that correspond to gene names through manual curation and comparison with Entrez Gene. Although fewer gene keywords were identified, which resulted in a noisier picture, we again found that the number of grant citations per gene also decays according to a power law (a = 0.39) (Figure 1c). Similar analyses based on keyword searches of grant abstracts, based on investor initiated (RO1) grant information from 2003 and 2004, all resulted in qualitatively similar results.
Understanding the function of all the genes in the mammalian genome is a goal shared by researchers and funding agencies alike. Success in achieving this goal will require concerted efforts to fight the power law and the principle of least effort. Specifically, these efforts will require the transformation of the observed exponential distributions to something that better approximates a normal distribution (or more precisely, a gamma distribution as shown in Figure 1d). This ideal distribution would indicate that the majority of genes have some minimal non-zero degree of gene annotation, with tails that extend in both directions. Recent progress in data-driven research and ongoing advances in genome-scale gene annotation are important steps toward achieving this transformation. These emerging techniques include gene and protein expression analysis, protein-protein interactions, and high-throughput screening using overexpression and RNA interference methodologies. Historically unbiased methods such as genetics will also contribute as candidate genomic loci are refined to the resolution of individual genes.
In summary, we have shown power-law-like distributions in gene annotation (measured by links to the biomedical literature) and research funding (measured by gene references in funded grants). This shows that the research community is still far from understanding the function of all mammalian genes, and instead focuses most of its effort on relatively few. While recent advances in data-driven and genome-scale research are promising, recognition of this phenomenon and a dramatic shift in the pattern of both scientific publishing and funding will be required for our goal of genome-wide gene annotation to be realized.
- Collins FS, Green ED, Guttmacher AE, Guyer MS, US National Genome Research Institute: A vision for the future of genomics research. Nature. 2003, 422: 835-847. 10.1038/nature01626.PubMedView Article
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2005, D54-D58. 33 Database
- Zipf GK: Human Behavior and the Principle of Least Effort. 1949, Cambridge, MA: Addison-Wesley
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks. Science. 2002, 297: 1551-1555. 10.1126/science.1073374.PubMedView Article
- Barabasi AL, Albert R: Emergence of scaling in random networks. Science. 1999, 286: 509-512. 10.1126/science.286.5439.509.PubMedView Article
- Mann T: A Guide to Library Research Methods. 1987, New York, NY: Oxford University Press
- CRISP. [http://crisp.cit.nih.gov]