Mining microarray expression data by literature profiling
© Chaussabel and Sher, licensee BioMed Central Ltd 2002
Received: 18 January 2002
Accepted: 18 July 2002
Published: 13 September 2002
The rapidly expanding fields of genomics and proteomics have prompted the development of computational methods for managing, analyzing and visualizing expression data derived from microarray screening. Nevertheless, the lack of efficient techniques for assessing the biological implications of gene-expression data remains an important obstacle in exploiting this information.
To address this need, we have developed a mining technique based on the analysis of literature profiles generated by extracting the frequencies of certain terms from thousands of abstracts stored in the Medline literature database. Terms are then filtered on the basis of both repetitive occurrence and co-occurrence among multiple gene entries. Finally, clustering analysis is performed on the retained frequency values, shaping a coherent picture of the functional relationship among large and heterogeneous lists of genes. Such data treatment also provides information on the nature and pertinence of the associations that were formed.
The analysis of patterns of term occurrence in abstracts constitutes a means of exploring the biological significance of large and heterogeneous lists of genes. This approach should contribute to optimizing the exploitation of microarray technologies by providing investigators with an interface between complex expression data and large literature resources.
Microarray technologies provide the means of measuring the expression of thousands of genes or proteins simultaneously. This revolution brings new perspectives for the study of expression networks and their regulation, potentially providing valuable insights into the molecular mechanisms underlying disease . Increasingly accessible microarray platforms allow the unrestrained and rapid generation of large expression datasets. As large volumes of data are being generated, the need for data-mining programs that provide the means to manage, normalize, filter, group and visualize expression data expands. These tools help to identify subsets of genes whose expression changes significantly and organize them according to their expression profiles. Although necessary, this type of analysis does not reveal the biological implications encrypted in expression data. Indeed, the evaluation of the functional significance of large, heterogeneous and noisy groups of genes constitutes the real challenge for microarray users .
A further problem is that the wealth of knowledge accumulated after decades of biological research has resulted in a considerable narrowing of research fields. As a consequence, in-depth knowledge of gene function possessed by highly specialized investigators is biased and limited to relatively small subsets of genes that become the focus of the expression-data analysis. The definition of functional classes and improved access to information associated with individual genes partly makes up for this lack of perspective. However, information about gene function is primarily contained in the 11 million articles indexed in the Medline database. Evaluating the functional associations that might exist among large groups of genes from this huge volume of literature is not feasible in a time frame compatible with the pace at which the data can be generated. Limitations in our capacity to explore the functional dimension of microarray expression are one of the major impediments to the optimal exploitation of this powerful technology. Surprisingly, only a few groups have previously addressed this shortcoming [3,4,5].
We describe here how a literature-derived term frequency database can be generated and mined through the analysis of patterns of occurrences of a restricted subset of relevant terms. This 'literature profiling' produces a coherent picture of the functional relationships among large and heterogeneous lists of genes and should enable the development of tools for rapidly extracting meaningful knowledge from large microarray expression databases.
Results and discussion
The method requires articles related to each of the genes included in the analysis to be extracted. This is done by querying the Medline database though PubMed  using appropriate search strings. We chose to retrieve entries containing the official gene name, abbreviation or aliases in the title field. Information about gene nomenclature can be found on the website of the Human Gene Nomenclature Committee (HGNC ). Using this source we created a database containing URLs in the PubMed query format for the more than 10,500 known human genes defined by HGNC (for example, for protein kinase C eta: the URL found in the database is http://www3.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=0&form=1&term=PRKCH+%5Bti%5D+OR+PKC-L+%5Bti%5D+OR+PRKCL+%5Bti%5D+OR+protein%20kinase%20C%20eta+%5Bti%5D; pointing a web browser to this address gives the 17 entries that would have been retrieved by typing the following search string: 'PRKCH [ti] OR PKC-L [ti] OR PRKCL [ti] OR protein kinase C eta [ti]'). URL entries are indexed by GenBank  and LocusLink  IDs and can be downloaded as a Microsoft Excel table (see Additional data files). The search for relevant literature for each individual gene is complicated by the fact that the same gene can have many different names associated with it and that the same name or abbreviation can have different meanings. A rapid scanning of the search results is useful for the identification and removal of inappropriate search strings (see below).
For each gene, the result of the query is downloaded in XML format. Abstracts are then extracted from the file by means of a macro running on Microsoft Excel and saved as a new file to be used for text analysis.
Methodologies described in this report were tested on a list of 70 genes (see Additional data files) derived from a subset of conditions belonging to a sample gene-expression dataset generated to study the transcriptional response of professional antigen-presenting cells to pathogens using high-density oligonucleotide arrays (D. Chaussabel, R. Semnani, M. Mcdowell, D. Sacks, A. Sher and T.B. Nutman, unpublished observations). We were able to find at least five relevant records in the Medline database containing abstracts for 44 out of the 70 genes listed. Another 10 genes had at least five records with accompanying abstracts when their generic name was used as a search string (for example, 'interferon induced transmembrane protein' instead of 'interferon induced transmembrane protein 1').
Word occurrence in abstracts is determined for each gene by analyzing the contents of Medline entries (nearly 4,000 in the example presented here). This parameter describes the relative frequency of abstracts containing a given word (for example, 18.2% of the abstracts indexed for the gene GADD45B contain the word 'proliferation').
Occurrence values are assigned to every unique word found in the literature analyzed, resulting in tens of thousands of entries for each gene. A vast majority of these terms are either found ubiquitously (for example, 'if, 'because', 'cell', 'identified' are present in most abstracts of most genes) or very rarely (present in very few abstracts of few genes) and therefore are of very little use for the definition of gene-specific term occurrence profiles. However, a third category of terms can be found in most abstracts of very few genes and convey relevant information about these genes. These terms are characterized both by high occurrence values in gene-specific collections of abstracts and a low baseline occurrence in the literature.
Term selection by filtering
Occurrence in abstracts
List of genes used to illustrate the technique and their abbreviations
ATP-binding cassette, subfamily B (MDR/TAP), member 2
Adenylate kinase 3
Baculoviral IAP repeat-containing 3
CASP8 and FADD-like apoptosis regulator
Dual specificity phosphatase 1
Dual specificity phosphatase 4
Dual specificity phosphatase 5
Interferon, alpha-inducible protein (clone IFI-6-16)
Growth arrest and DNA-damage-inducible, alpha
Growth arrest and DNA-damage-inducible, beta
Guanylate binding protein 1, interferon-inducible, 67kD
GTP cyclohydrolase 1
H2A histone family, member O
Major histocompatibility complex, class I, F
Interferon-induced protein with tetratricopeptide repeats
Interferon induced transmembrane protein
Interleukin 15 receptor
Interleukin 7 receptor
Interferon induced protein 10
Interferon induced protein 9
Interferon regulatory factor 4
Interferon regulatory factor 7
Interferon-stimulated protein, 15 kDa
Interferon stimulated gene (20kD)
Monocyte chemotactic protein 2
Monocyte chemotactic protein 3
Monokine induced by gamma interferon
Macrophage inflammatory protein 3 alpha
Matrix metalloproteinase 9
Metallothionein 1A (functional)
Myxovirus (influenza) resistance 1
Myxovirus (influenza) resistance 2
Nuclear factor kappaB 1 (p105)
Nuclear factor kappaB2 (p49/p100)
Nuclear factor kappaB inhibitor, alpha
Nuclear receptor subfamily 4, group A, member 3 (NOR1)
Phosphodiesterase 4B, cAMP-specific
Proteasome (prosome, macropain) subunit, alpha
Proteasome activator subunit 2 (PA28)
Protein tyrosine phosphatase 1B
Superoxide dismutase 2, mitochondrial
Signal transducer and activator of transcription 1, 91kD
Signal transducer and activator of transcription 4
Tumor necrosis factor, alpha-induced protein 3
Tumor necrosis factor, alpha-induced protein 6
TNF receptor-associated factor 1
Vascular endothelial growth factor
It is notable that the functional groups identified in this list of genes significantly induced after infection of professional antigen-presenting cells are related to immune responses. Genes for transcription factors that control inflammatory responses and programmed cell death make up the first gene cluster considered (Figure 2, color coded in blue). These genes have abstracts with a frequent occurrence of terms such as 'TNF' (the inflammatory mediator tumor necrosis factor), 'death' or 'apoptosis'. The largest group is composed of genes associated with the term 'interferon' (also 'IFN' and 'IFN-alpha', color coded green, Figure 2); indeed, STATs are factors specifically required for interferon signaling. Interferon regulatory factors (IRFs) trigger the interferon response, whereas other members of the group are effector antiviral molecules (for example, ISG15, ISG20) sometimes associated with terms such as 'virus', 'infected' or 'infection' (OAS, Mx1, Mx2). The next group (Figure 2, red) is composed exclusively of chemokines. Interestingly, the analysis of abstract contents was able to distinguish monokines belonging to the CXCR family (SCYB chemokines: IP-9, IP-10, MIG; associated with 'CXC', 'CXCR', 'monokine' or 'MIG') from CC chemokines (SCYA chemokines: LARC, RANTES, MCP2, MCP3). The last group (Figure 2, violet) is composed of genes involved throughout the MHC class I antigen-presentation pathway. Specifically, these genes encode proteins involved in the degradation of proteins into peptides by the immunoproteasome (PSMAs, PSME), antigenic peptide loading and transport (ABCB2 also known as TAP1, for transporter associated with antigen processing 1) and presentation at the cell surface (HLA-F, B2M). It is notable that one of the closest pairs formed consists of a receptor-ligand pair: VEGF and NRP2 (Figure 3). Overall, these examples illustrate the concept that appropriate terms taken out of context can still convey valuable information and can be used to rapidly explore and assess the biological meaning of complex datasets.
Analyzing patterns of term occurrence in groups of genes with different degrees of association
Conditions for the formation of 'meaningful' gene clusters
To exclude the possibility that groups of meaningful genes may arise by using a sufficient number of co-occurring terms, we permuted term-occurrence values for each gene before clustering (Figure 5c). The fact that this treatment results in a complete loss of the original hierarchy proves that the formation of meaningful groups of genes cannot be attributed to a clustering artifact.
Literature profiling of large gene lists
The size of the list of genes that must be analyzed can vary greatly from one microarray experiment to another. In an ideal setting, the analysis of gene-expression patterns groups co-regulated genes into small subsets. In most cases, however, partitioning of the data on the basis of expression is impaired by a small number of conditions or straightforward expression profiles. As a consequence, microarray experiments often generate lists of several hundred genes for which biological meaning must be sought. The use of a mining technique such as the one described here will be most valuable in this context. In this section we give two examples of literature profiles generated from published datasets.
When a large number of genes are analyzed, the level of noise (less-specific terms) can be more important, and filtering criteria were adjusted accordingly. The fixed 25% cut-off we used in our previous example can be too high for a gene represented by hundreds of abstracts but can also be relatively low when considering a gene for which only five abstracts could be retrieved. To take such discrepancies into account we optimized the cut-off for each gene as follows: cut-off = t + (k/n) where t is the minimum threshold, k is a constant and n is the number of abstracts retrieved for a given gene; t and k must be set arbitrarily and will directly influence resolution and noise levels. For these examples we chose t = 15% and k = 1.5, therefore cut-off values for genes represented by 5 or 100 abstracts are 45% and 16.5% respectively. The gene-term specificity was further improved by adding a filter that removes terms present in the vocabulary of more than half of the genes considered (for example, 'bound', 'contained', 'clones', 'putative', 'process'). Such a filter is particularly appropriate for large datasets, as the chance of less-specific terms being retained by other filters increases with the number of genes analyzed. The functional heterogeneity inherent in large gene lists eliminates the risk of relevant terms being removed by this filter. Similar themes were identified when the cut-off applied in the previous example is used instead. However, increasing the stringency of the filter resulted in a tighter clustering of large datasets. In these examples we eliminated redundant singular/plural forms by averaging term-occurrence values derived from both entries (considering, for instance, 'lipoprotein' and 'lipoproteins' as a single entity).
Several groups of genes involved in different aspects of the immune response to an infection were uncovered by literature profiling (interferon response, chemotaxis, inflammation: Figure 6d,6f, and 6g, respectively). But many other functional groups were also identified as follows.
As indicated by their names, lipoprotein lipase (LPL) and low-density lipoprotein receptor (LDLR) genes are involved in lipid and cholesterol metabolism and were logically associated by literature profiling (Figure 6a). Interestingly, the CD36-like 1 antigen (alias thrombospondin receptor-like 1 - CD36L1) clustered tightly with LPL and LDLR and shared with these genes terms such as 'lipoprotein', 'lipid' or 'cholesterol' (Figure 6a). This association was validated by browsing the literature relevant to CD36L1 whch contains reports showing the role of this molecule as a receptor for high-density lipoprotein.
The two major groups of proteinases involved in extracellular matrix degradation - serine proteinases and metalloproteinases - have been grouped by literature profiling (Figure 6b: urokinase plasminogen activation cascade (UPA, PLAUR, SERPIN) and matrix metalloproteinases (MMP14, MMP10, MMP1)). Both families are activated during inflammation and, as indicated by their literature profiles, are involved in tumor invasion and metastasis [13,14]. In the context of a bacterial infection these proteins enable activated macrophages to cross endothelial barriers and gain access to the site of the infection [15,16] (other terms shared by these genes are 'migration', 'vascular', 'endothelial'). An extracellular-matrix-binding protein, SPARC (secreted protein, acidic, cysteine-rich, alias osteonectin), was also associated with these proteinases by literature profiling. SPARC can increase endothelial permeability and is known to participate in tumor angiogenesis and extravasation . Interestingly, this protein has not been reported as being upregulated upon cell infection and its possible role in macrophage transendothelial migration was never addressed. This example illustrates how functional relationships that could not be deduced from gene names were uncovered through the analysis of patterns of term occurrence: matrix metalloproteinases (MMP1, 10, 14) and urokinase plasminogen activator (UPA, SPARC) are matrix-interacting molecules involved in tumor invasion and metastasis.
The cluster shown in Figure 6c is composed of members of two genes families: adenosine receptors (ADORA3 and ADORA2A) and purinergic receptors (P2RX1 and P2RX7). Indeed, although not evident from its name, P2RX acts as a receptor for a phosphorylated form of adenosine (adenosine triphosphate).
Another interesting example where non-obvious associations were revealed by literature profiling is shown in Figure 6e. This group consists of genes for which related abstracts have in common terms such as 'disorder', 'allele', 'recessive' or 'autosomal'. This shared vocabulary is indicative of an association that, given the diversity of genes implicated, would have undoubtedly been overlooked by the mere examination of the gene list. Indeed, a rapid search of the Online Mendelian Inheritance in Man database (OMIM ) for genes associated with the terms 'severe' and 'disorder' confirmed that mutations of GALC, LAMB3, GJB2, JAG1, TGFBI, LPL and LDLR were the origin of serious disorders: Krabbe disease, Herlitz junctional epidermolysis bullosa, autosomal dominant deafness - Vohwinkel syndrome, Alagille syndrome, corneal dystrophy, type I hyperlipoproteinemia and hypercholesterolemia, respectively. In addition, two genes sharing a similar vocabulary could be found outside the region outlined in Figure 6e: GCDH (linked to glutaric acidemia type I) and MPI (linked to carbohydrate-deficient glycoprotein syndrome, type Ib).
Taken together, these examples demonstrate the power of the analysis of literature profiles in revealing unsuspected functional relationships in large and heterogeneous lists of genes.
Benefits and limitations
The mining technique we describe is designed to guide the interpretation of complex expression databases. Key aspects of the technique contributing to the fulfillment of this goal include. The method is independent of the user's knowledge of gene function and can therefore be used to identify promising findings rapidly in an unbiased way. The method renders the data intelligible by bringing functional coherence to large and heterogeneous lists of genes. The terms used as criteria to explore relationships among genes differ with the composition of the group of genes considered for analysis. Because the basis for classifying genes is flexible, associations made between them will change with the context in which they are found. The technique is based on the analysis of the content of scientific publications and constitutes a contemporary solution for the exploitation of swelling literature resources by providing investigators with leads for further in-depth investigation of the literature. Term-occurrence data derived from literature profiling can be used to annotate heterogeneous gene lists, thus adding to the value of this technique as a visualization tool (Figure 3).
The implementation of our mining technique as a computational tool is hindered by the need to retrieve the relevant literature reliably for each gene included in the analysis. Indeed, gene-by-gene editing of automatically generated PubMed query strings is often required to insure low levels of false positives among the abstracts retrieved. Several names and abbreviations are often associated with a single gene but are used in a different context (for example, to designate drugs, bacterial strains or medical procedures), or they belong to the English vocabulary (e.g. 'Wars' = 'tryptophanyl-tRNA synthetase', 'Sky' = 'TYRO3 protein tyrosine kinase', 'God' = 'Godzilla'). Short acronyms are especially problematic (for example, 'CT', the abbreviation for 'calcitonin' can be found in the title of over 20,000 abstracts, of which only 25 contain the term 'calcitonin'). Parsing issues that are caused by a confusing gene nomenclature can, however, be avoided when curated literature resources are available (for example, the Yeast Literature Database ).
The reduction of the information contained in the literature is also limiting. Words taken out of their context convey useful but limited information, and this superficial assessment of the literature can only be used to direct further investigation. The selection of terms through rounds of filtering inevitably results in the selection of irrelevant terms (false positives), and pertinent terms will also be lost (false negatives). Reviewing the terms and literature that prompted the definition of relationships among genes can easily identify false-positive associations. False-negative associations are harder to identify and can only be kept to a minimum by combining existing approaches designed to assess the biological significance of large sets of genes. Like the other literature-mining approaches previously published, our technique cannot be expected to give definitive answers, but nonetheless provides investigators with much-needed solutions for the functional evaluation of complex microarray data .
Relatively few groups have attempted to resolve the bottleneck constituted by the inability of highly specialized investigators to assess the existence of relationships between genes in a high-throughput fashion [3,4,5]. Jenssen et al.  analyzed literature contents to create a gene-to-gene co-citation network revealing associations between genes. Our technique differs fundamentally from Jenssen et al.'s method in that it is based on term occurrences in indexed abstracts as opposed to gene name co-citation frequencies. This approach allowed us to take advantage of the powerful algorithms used for the analysis of patterns of gene expression. Also, this literature-profiling method should benefit from ongoing efforts to improve visualization tools, clustering techniques, and associated statistics [36,37]. Another major advantage arises from the capacity to include any of the terms present in abstracts, resulting in a considerable increase in the number of potential relationships generated. Finally, the number of genes covered by this type of analysis is also much greater, thanks to the low requirements in the volume of literature associated with each gene.
Text-mining software is also available commercially. Omniviz , one of the most advanced solutions for the analysis of the scientific literature, can group publications (or patents or any other kind of text entries) associated with a common theme (for example, Alzheimer's disease) through the analysis of their content. In contrast, our mining algorithm was specifically designed to group genes (instead of publications) through the analysis of the content of their associated literatures. This conceptual difference makes the techniques distinct from one another. Our approach requires the literature to be indexed for each gene and treated separately throughout the analysis. We also filter terms using stringent criteria, a critical step that allows the analysis of patterns of term occurrences by hierarchical clustering.
Applications and perspectives
This report constitutes a proof of principle on the feasibility and use of literature profiling for high-throughput research. Although room exists for improvements in indexing, filtering and clustering strategies, the methodology described provides a blueprint for the development of computational tools that can rapidly assess literature content to guide the biological interpretation of complex expression data. Because this literature-mining technique analyzes data at a high level it is independent of the platform used by investigators (for example, spotted cDNA or high-density oligonucleotide arrays, protein arrays) and could find applications in both genomics and proteomics research.
In addition to providing help to explore large expression datasets, occurrence values displayed for certain terms in the format used in Figure 3 can be used to annotate large and complex lists of genes, providing readers with information on gene function. In our example, giving occurrence values for terms such as 'apoptosis', 'endothelial', 'interferon', 'inflammatory' 'chemoattractant' or 'histocompatibility' provides a 'naive' reader with insight into the function attributed in the literature to each of the listed genes.
Associating literature profiles with gene-expression data could be used for orienting gene discovery. It is believed that co-regulated genes share similar promoters and/or are involved in similar biological processes . Using this principle of 'guilt by association', functions attributed to known genes can be inferred for unknown genes sharing similar expression profiles. In the first example used in this report (see Additional data files), many of the genes were identified using literature profiles as being related to 'interferon', 'virus' and 'infection', and thus it can be assumed that some of the genes and ESTs that were not included in the analysis from lack of literature (see Additional data files) are also associated with these terms. For instance, among the co-regulated ESTs is the 'Homo sapiens cig5 mRNA, partial sequence' (AF026941), which was obtained using differential display analysis to identify sequences for which transcription is induced following cytomegalovirus infection . Another co-regulated but poorly studied gene is 'secreted and transmembrane 1' (U77643), which resembles a cytokine or growth factor in its broad structural characteristics . This gene was later reported to be the ligand for the surface antigen CD7 and found to be capable of activating NK cells , which constitute the primary source of IFN-gamma during early responses to infection . In both examples the link to 'interferon', 'virus' or 'infection' can only be suspected, but certainly deserves attention because these sequences are regulated together with genes known to be involved in the biology of interferons.
The sequencing of whole genomes and the introduction of technologies capable of measuring simultaneously the expression of thousands of genes provides biological research with a global perspective that opposes the trend over the past few decades of the narrowing into highly specialized research fields. But the optimal exploitation of these invaluable resources by researchers necessitates the development of mining tools to explore and interpret data in a time frame compatible with the impressive rate at which they are generated. Individual knowledge is built on associations made between the information we acquire from the literature. The method we describe here mimics this learning process by associating meaningful terms found in scientific publications to create a coherent picture of the relationships that exist within complex groups of genes. Because this analysis is performed independently of knowledge of gene function it provides a means of rapidly probing the biological significance of complex expression data in an unbiased fashion.
Materials and methods
Relevant literature was retrieved for each gene by querying Medline for entries containing gene names or abbreviations or aliases. The URL database used to generate basic PubMed search strings for human genes can be downloaded (see Additional data files). The database is indexed by LocusLink  and GenBank IDs . Most search strings must be edited on a gene-by-gene basis, as a vast majority of publications do not adhere to the official nomenclature and gene names and abbreviations in use can differ from the aliases provided by HGNC or lack specificity (see discussion in 'Benefits and limitations'). Acronyms that contain only few letters are particularly problematic and must often be removed from the query in order to avoid high proportions of false-positive hits.
Data were filtered as described in Results and discussion using Microsoft Excel. The spreadsheet used to filter the gene list analyzed in this report and baseline occurrence values can be downloaded (see Additional data files).
The literature profiles generated for the two large published datasets analyzed in this report can be downloaded (see Additional data files) and explored using the clustergram browser Treeview, which is available online at no charge . The three types of files provided for each example (ATR, GTR, CDT) must be copied in the same folder before opening the CDT file with Treeview.
Results from PubMed queries can be downloaded using the save button on the toolbar after selecting the appropriate output format (the default output format - 'summary' - must be substituted by 'XML'). Abstracts are extracted from the output files and saved in a new file containing abstracts separated by a new line. The text analysis of abstract content was performed using the simstat/wordstat modules (Provalis Research, Montreal). Individual files are merged into a single file by Wordstat's 'document conversion wizard' that can be opened in simstat and analyzed by running the 'content analysis' statistics. The output consists of a table (crosstab - tabulate: word occurrence; display: category percent), which can be saved as a tab-delimited text file.
Clustering analysis was performed using Cluster/Treeview programs available from the Eisen lab website . Genes were grouped using the average linkage hierarchical clustering algorithm.
Additional data files
Additional tables contain an index of the gene abbreviations used throughout the paper and a detailed list of non-obvious functional relationships identified by the exploration of Figures 6 and 7. Our URL database of indexed PubMed entries (in Microsoft Excel or tab delimited text formats) and a sample term filtering table are available. The literature profiles of Figures 6 and 7 (ATR (Figure 6 and Figure 7), GTR (Figure 6 and Figure 7) and CDT files (Figure 6 and Figure 7)) can be read by the well-known open source dendrogram browser Treeview .
We thank Glynn Dennis, Karl Hoffman, Doug Hosack, Peter Lemkin, Richard Lempicki, James Johndrow and Vishvanath Nene for their critical reading of the manuscript and helpful suggestions.
- Schulze A, Downward J: Navigating gene expression using microarrays - a technology review. Nat Cell Biol. 2001, 3: E190-E195. 10.1038/35087138.PubMedView ArticleGoogle Scholar
- Schulze A, Downward J: Analysis of gene expression by microarrays: cell biologist's gold mine or minefield?. J Cell Sci. 2000, 113: 4151-4156.PubMedGoogle Scholar
- Masys DR, Welsh JB, Lynn Fink J, Gribskov M, Klacansky I, Corbeil J: Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics. 2001, 17: 319-326. 10.1093/bioinformatics/17.4.319.PubMedView ArticleGoogle Scholar
- Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN: MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques. 1999, 27: 1210-1214.PubMedGoogle Scholar
- Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001, 28: 21-28. 10.1038/88213.PubMedGoogle Scholar
- PubMed. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed]
- Human gene nomenclature committee. [http://www.gene.ucl.ac.uk/nomenclature/]
- GenBank. [http://www.ncbi.nlm.nih.gov/Genbank/index.html]
- LocusLink. [http://www.ncbi.nlm.nih.gov/LocusLink/index.html]
- Eisen Lab. [http://rana.lbl.gov/index.htm]
- Quackenbush J: Computational analysis of microarray data. Nat Rev Genet. 2001, 2: 418-427. 10.1038/35076576.PubMedView ArticleGoogle Scholar
- Nau GJ, Richmond JF, Schlesinger A, Jennings EG, Lander ES, Young RA: Human macrophage activation programs induced by bacterial pathogens. Proc Natl Acad Sci USA. 2002, 99: 1503-1508. 10.1073/pnas.022649799.PubMedPubMed CentralView ArticleGoogle Scholar
- Festuccia C, Giunciuglio D, Guerra F, Villanova I, Angelucci A, Manduca P, Teti A, Albini A, Bologna M: Osteoblasts modulate secretion of urokinase-type plasminogen activator (uPA) and matrix metalloproteinase-9 (MMP-9) in human prostate cancer cells promoting migration and matrigel invasion. Oncol Res. 1999, 11: 17-31.PubMedGoogle Scholar
- Foda HD, Zucker S: Matrix metalloproteinases in cancer invasion, metastasis and angiogenesis. Drug Discov Today. 2001, 6: 478-482. 10.1016/S1359-6446(01)01752-4.PubMedView ArticleGoogle Scholar
- Ferrero E, Vettoretto K, Bondanza A, Villa A, Resnati M, Poggi A, Zocchi MR: uPA/uPAR system is active in immature dendritic cells derived from CD14+CD34+ precursors and is down-regulated upon maturation. J Immunol. 2000, 164: 712-718.PubMedView ArticleGoogle Scholar
- Vaalamo M, Kariniemi AL, Shapiro SD, Saarialho-Kere U: Enhanced expression of human metalloelastase (MMP-12) in cutaneous granulomas and macrophage migration. J Invest Dermatol. 1999, 112: 499-505. 10.1046/j.1523-1747.1999.00547.x.PubMedView ArticleGoogle Scholar
- Kato Y, Lewalle JM, Baba Y, Tsukuda M, Sakai N, Baba M, Kobayashi K, Koshika S, Nagashima Y, Frankenne F, et al: Induction of SPARC by VEGF in human vascular endothelial cells. Biochem Biophys Res Commun. 2001, 287: 422-426. 10.1006/bbrc.2001.5622.PubMedView ArticleGoogle Scholar
- Online Mendelian Inheritance in Man. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM]
- Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, et al: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002, 415: 436-442. 10.1038/415436a.PubMedView ArticleGoogle Scholar
- Mireskandari A, Reid RL, Kashanchi F, Dittmer J, Li WB, Brady JN: Isolation of a cDNA clone, TRX encoding a human T-cell lymphotrophic virus type-I Tax1 binding protein. Biochim Biophys Acta. 1996, 1306: 9-13. 10.1016/0167-4781(96)00012-7.PubMedView ArticleGoogle Scholar
- Lu R, Yang P, O'Hare P, Misra V: Luman, a new member of the CREB/ATF family, binds to herpes simplex virus VP16-associated host cellular factor. Mol Cell Biol. 1997, 17: 5117-5126.PubMedPubMed CentralView ArticleGoogle Scholar
- Gatignol A, Kumar A, Rabson A, Jeang KT: Identification of cellular proteins that bind to the human immunodeficiency virus type 1 trans-activation-responsive TAR element RNA. Proc Natl Acad Sci USA. 1989, 86: 7828-7832.PubMedPubMed CentralView ArticleGoogle Scholar
- De Valck D, Jin DY, Heyninck K, Van de Craen M, Contreras R, Fiers W, Jeang KT, Beyaert R: The zinc finger protein A20 interacts with a novel anti-apoptotic protein which is cleaved by specific caspases. Oncogene. 1999, 18: 4182-4190. 10.1038/sj.onc.1202787.PubMedView ArticleGoogle Scholar
- Jin DY, Wang HL, Zhou Y, Chun AC, Kibler KV, Hou YD, Kung H, Jeang KT: Hepatitis C virus core protein-induced loss of LZIP function correlates with cellular transformation. EMBO J. 2000, 19: 729-740. 10.1093/emboj/19.4.729.PubMedPubMed CentralView ArticleGoogle Scholar
- Benkirane M, Neuveut C, Chun RF, Smith SM, Samuel CE, Gatignol A, Jeang KT: Oncogenic potential of TAR RNA binding protein TRBP and its regulatory interaction with RNA-dependent protein kinase PKR. EMBO J. 1997, 16: 611-624. 10.1093/emboj/16.3.611.PubMedPubMed CentralView ArticleGoogle Scholar
- Holtrich U, Wolf G, Brauninger A, Karn T, Bohme B, Rubsamen-Waigmann H, Strebhardt K: Induction and down-regulation of PLK, a human serine/threonine kinase expressed in proliferating cells and tumors. Proc Natl Acad Sci USA. 1994, 91: 1736-1740.PubMedPubMed CentralView ArticleGoogle Scholar
- Guo SS, Wu X, Shimoide AT, Wong J, Sawicki MP: Anomalous overexpression of p27(Kip1) in sporadic pancreatic endocrine tumors. J Surg Res. 2001, 96: 284-288. 10.1006/jsre.2001.6085.PubMedView ArticleGoogle Scholar
- Hernandez S, Hernandez L, Bea S, Pinyol M, Nayach I, Bellosillo B, Nadal A, Ferrer A, Fernandez PL, Montserrat E, et al: cdc25a and the splicing variant cdc25b2, but not cdc25B1, -B3 or -C, are over-expressed in aggressive human non-Hodgkin's lymphomas. Int J Cancer. 2000, 89: 148-152. 10.1002/(SICI)1097-0215(20000320)89:2<148::AID-IJC8>3.3.CO;2-I.PubMedView ArticleGoogle Scholar
- Molthagen M, Schachner M, Bartsch U: Apoptotic cell death of photoreceptor cells in mice deficient for the adhesion molecule on glia (AMOG, the beta 2- subunit of the Na, K-ATPase). J Neurocytol. 1996, 25: 243-255.PubMedView ArticleGoogle Scholar
- Gloor S, Antonicek H, Sweadner KJ, Pagliusi S, Frank R, Moos M, Schachner M: The adhesion molecule on glia (AMOG) is a homologue of the beta subunit of the Na, K-ATPase. J Cell Biol. 1990, 110: 165-174.PubMedView ArticleGoogle Scholar
- Katayama Y, House CM, Udagawa N, Kazama JJ, McFarland RJ, Martin TJ, Findlay DM: Casein kinase 2 phosphorylation of recombinant rat osteopontin enhances adhesion of osteoclasts but not osteoblasts. J Cell Physiol. 1998, 176: 179-187. 10.1002/(SICI)1097-4652(199807)176:1<179::AID-JCP19>3.3.CO;2-M.PubMedView ArticleGoogle Scholar
- Takeshita S, Kikuno R, Tezuka K, Amann E: Osteoblast-specific factor 2: cloning of a putative bone adhesion protein with homology with the insect protein fasciclin I. Biochem J. 1993, 294: 271-278.PubMedPubMed CentralView ArticleGoogle Scholar
- Miyazono K: TGF-beta signaling by Smad proteins. Cytokine Growth Factor Rev. 2000, 11: 15-22. 10.1016/S1359-6101(99)00025-8.PubMedView ArticleGoogle Scholar
- Yeast Literature. [http://genome-www.stanford.edu/Saccharomyces/literature.html]
- Masys DR: Linking microarray data to the literature. Nat Genet. 2001, 28: 9-10. 10.1038/88324.PubMedGoogle Scholar
- Getz G, Levine E, Domany E: Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci USA. 2000, 97: 12079-12084. 10.1073/pnas.210134797.PubMedPubMed CentralView ArticleGoogle Scholar
- Kerr MK, Churchill GA: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci USA. 2001, 98: 8961-8965. 10.1073/pnas.161273698.PubMedPubMed CentralView ArticleGoogle Scholar
- Omniviz. [http://www.omniviz.com/]
- Zhu H, Cong JP, Shenk T: Use of differential display analysis to assess the effect of human cytomegalovirus infection on the accumulation of cellular RNAs: induction of interferon-responsive RNAs. Proc Natl Acad Sci USA. 1997, 94: 13985-13990. 10.1073/pnas.94.25.13985.PubMedPubMed CentralView ArticleGoogle Scholar
- Slentz-Kesler KA, Hale LP, Kaufman RE: Identification and characterization of K12 (SECTM1), a novel human gene that encodes a Golgi-associated protein with transmembrane and secreted isoforms. Genomics. 1998, 47: 327-340. 10.1006/geno.1997.5151.PubMedView ArticleGoogle Scholar
- Lyman SD, Escobar S, Rousseau AM, Armstrong A, Fanslow WC: Identification of CD7 as a cognate of the human K12 (SECTM1) protein. J Biol Chem. 2000, 275: 3431-3437. 10.1074/jbc.275.5.3431.PubMedView ArticleGoogle Scholar
- Biron CA, Brossay L: NK cells and NKT cells in innate defense against viral infections. Curr Opin Immunol. 2001, 13: 458-464. 10.1016/S0952-7915(00)00241-7.PubMedView ArticleGoogle Scholar