A metagenomic study of diet-dependent interaction between gut microbiota and host in infants reveals differences in immune response
© Schwartz et al.; licensee BioMed Central Ltd. 2012
Received: 21 December 2011
Accepted: 30 April 2012
Published: 30 April 2012
Gut microbiota and the host exist in a mutualistic relationship, with the functional composition of the microbiota strongly affecting the health and well-being of the host. Thus, it is important to develop a synthetic approach to study the host transcriptome and the microbiome simultaneously. Early microbial colonization in infants is critically important for directing neonatal intestinal and immune development, and is especially attractive for studying the development of human-commensal interactions. Here we report the results from a simultaneous study of the gut microbiome and host epithelial transcriptome of three-month-old exclusively breast- and formula-fed infants.
Variation in both host mRNA expression and the microbiome phylogenetic and functional profiles was observed between breast- and formula-fed infants. To examine the interdependent relationship between host epithelial cell gene expression and bacterial metagenomic-based profiles, the host transcriptome and functionally profiled microbiome data were subjected to novel multivariate statistical analyses. Gut microbiota metagenome virulence characteristics concurrently varied with immunity-related gene expression in epithelial cells between the formula-fed and the breast-fed infants.
Our data provide insight into the integrated responses of the host transcriptome and microbiome to dietary substrates in the early neonatal period. We demonstrate that differences in diet can affect, via gut colonization, host expression of genes associated with the innate immune system. Furthermore, the methodology presented in this study can be adapted to assess other host-commensal and host-pathogen interactions using genomic and transcriptomic data, providing a synthetic genomics-based picture of host-commensal relationships.
The gut microbiota has profound effects on the health and wellness of the host. For example, studies in germ-free piglets clearly illustrate altered intestinal growth , digestive enzyme activity  and development of the gut-associated lymphoid tissue . Molecular-level studies, enabled by metagenomic, metatranscriptomic and metaproteomic analytical techniques, are reshaping our understanding of how the gut microbiome modulates gastrointestinal morphological, immune development [1–4], gene expression , and the biology of the host in general [6, 7]. Although many studies have shown an effect of diet on the infant microbiota [8–10], little is known of the genome and transcriptome-level cross-talk between the developing infant gut and the colonizing microbiota. At birth, the intestinal tract of the human infant is functionally immature and sterile. Accordingly, the early neonatal period is a critical phase for both intestinal digestive development as well as colonization by the commensal microbiota.
The human intestine is lined by epithelial cells that process nutrients and provide the first line of defense against food antigens and pathogens. Approximately one-sixth of intestinal epithelial cells are shed (exfoliated) daily . This corresponds to the daily exfoliation of 108 to 1010 cells . Because colonization of the intestine with non-pathogenic (commensal) microbiota is vital for neonatal intestinal development [1, 2, 5], it is important to understand how epithelial cells and the microbial ecosystem are modulated by diet. Therefore, our ongoing efforts have been directed at understanding the regulation of neonatal development by components present in human milk. Our initial work isolated exfoliated eukaryotic 'host' cell mRNA from feces, which contain sloughed (intact) intestinal cells, to determine which gene combinations best distinguish the feeding groups. We previously reported that two- and three-gene combinations provide classifiers with potential to non-invasively identify discriminative molecular signatures in the developing human neonate . Specifically, linear discriminant analysis (LDA) was used to identify the best single, two and three-gene combinations for classifying the experimental treatments. LDA is a technique developed for the purpose of statistical pattern recognition . Using a selected list of features, it aims at constructing a discriminating hyperplane that separates the observations from two different classes with a minimum misclassification error. Therefore, gene sets or combinations are identified in response to treatments, as opposed to simply determining up- or down-regulated mRNA expression levels. It is important to emphasize that, previously, our main objective was to identify candidate biomarker genes , and not to probe for interrelationships between the host gut transcriptome and metagenome. In particular, we focused on two major issues: finding groups of genes that discriminate between breast-fed and formula-fed babies, in terms of LDA classification; and identifying potential 'master' regulators as defined by the statistical properties of the non-linear coefficient of determination (CoD). The current manuscript uses a linear model, canonical correlation analysis (CCA), in order to detect interdependencies between the host intestinal transcriptome and the metagenome in healthy full-term infants.
We now present a systematic and statistically rigorous analytical framework for the simultaneous examination of both host and microbial responses to dietary/environmental components in the early neonatal period. Specifically, we tested the hypothesis that the integration of infant (host) epithelial cell transcriptome and functionally profiled microbiome can be used to suggest important regulatory pathways of the microbiome affecting intestinal development in the first few months of life. Initially, we examined the multivariate correlation structures between host intestinal mRNA gene expression levels and functional annotations in genes in the gut metagenome of exclusively breast-fed (BF) and formula-fed (FF) infants at three months of age. Interestingly, we found that the microbiome of BF infants is significantly enriched in genes associated with virulence functionality. Furthermore, we demonstrate a multivariate correlation between the gut flora genes associated with bacterial pathogenicity and the expression of host genes associated with immune and defense mechanisms. In addition, the operational taxonomic unit (OTU) composition and genetic potential of the microbiota differed between BF and FF infants. Our findings suggest that human milk promotes the mutualistic crosstalk between the mucosal immune system and the microbiome in the maintenance of intestinal homeostasis.
Infant growth characteristics
30.0 ± 4.6
30.7 ± 5.9
2.0 ± 0.0
2.1 ± 0.6
5 male/1 female
4 male/2 female
Length at birth (cm)
53.2 ± 3.1
51.0 ± 2.5
Body weight (kg)
3.79 ± 0.50a
3.46 ± 0.20a
At month 1
4.98 ± 0.73b
4.61 ± 0.65b
At month 2
6.43 ± 0.73c
5.66 ± 0.85c
At month 3
7.02 ± 0.72d
6.45 ± 0.96d
BF or FF diet intake (ml/kg/day)
At month 1
166.0 ± 18.3a
162.5 ± 28.4a
At month 2
127.6 ± 19.5a
138.5 ± 14.3a,b
At month 3
129.0 ± 20.1b
134.8 ± 9.5b
Effect of diet on host transcriptional responses
Effect of diet on the gut microbiome
To confirm our findings, we used PhymmBL to taxonomically classify shotgun sequence reads. PhymmBL  is a classification approach for metagenomics data that uses interpolated Markov models (IMMs) and Basic Local Alignment Search Tool (BLAST) to taxonomically classify DNA sequences. The reads were assigned to phyla as summarized in Additional files 4 and 5. While there was general agreement between the 16S-based analysis and the whole shotgun-reads-based analysis, we did identify some inconsistencies. These corresponded to similar discrepancies found in Koenig et al. , and are possibly due to under-representation of 16S rRNA from Actinobacteria. Overall, both analyses are consistent with a previous report indicating a high level of Actinobacteria and Proteobacteria in infants at 3 months of age .
The virulence characteristics of the microbiota were the only functional characteristics that appeared to differ between the BF and FF infants. However, we tested the remaining 36 of 149 non-virulence SEED level 2 categories in which all the FF or BF samples had at least 100 reads comprising a minimum of 0.5% of the total number of reads. All P-values were greater than 0.05, and we did not, therefore, calculate q-values or examine the non-virulence SEED level 2 classifications. For the 84 of 584 SEED level 3 categories (consisting of sub-classifications of 'accessory colonization factor', 'Ton and Tol transport systems', 'type 1 pili (mannose-sensitive fimbriae)', 'the Streptococcus pyogenes Virulom', 'bacterial cyanide production' and 'tolerance mechanisms') for which all the FF or BF samples had at least 50 reads comprising at least 0.1% of the total number of reads, a permutation test was used to examine if the relative abundance of the functional category varied between BF and FF infants. Some P-values were <0.05, but no q-values were <0.10 upon FDR correction and so SEED level 3 classifications were not examined.
Interactions between the gut microbiome and the host transcriptome
For the purpose of uncovering potential symbiotic gut microbial-host metabolic interactions, a variation of CCA was used to examine the multivariate structure between the most promising virulence characteristics of the microbiota (resistance to antibiotics and toxic compounds, Type III, Type IV, ESAT secretion systems, and iron scavenging mechanisms) and host transcriptome data sets. For each gene triple selected from a transcriptome set and analyzed with the metagenomic virulence variables (as described in Materials and methods) there were three canonical correlations. Canonical correlations represent the strongest (ordered) correlations created between linear composites (called canonical variates) of the gene triples with the metagenomic variables (subject to some optimization constraints involving the independence and variation of canonical variates). Hence, they represent the strength of the linear multivariate relationship between the particular host gene triple being analyzed and the microbiome virulence variables [21–23].
Relative gene expression levels in breast-fed (BF) versus formula-fed (FF) infants following a 3-month feeding period
Our ongoing efforts are directed at understanding the regulation of neonatal gastrointestinal development by components present in human milk. The neonatal gastrointestinal tract undergoes pronounced structural and functional changes , which are influenced by diet [25, 26]. For example, there is a stronger trophic response to human milk than formula, suggesting that the bioactive components in human milk are important for gastrointestinal development [27, 28]. Furthermore, the composition of the neonatal microbiota undergoes successional changes, which is profoundly influenced by diet [8–10]. Given the need to better understand neonatal gastrointestinal health and development, we recently demonstrated that host gene set combinations provide discriminative molecular signatures for distinguishing BF versus FF infants . However, no studies to date have attempted to systematically integrate genomic data from both the infant (host mucosa) and gut microbiome. Therefore, the goal of this study was to assess host gene-diet interactions within the context of the structure and operations of gut microbial communities. As part of this effort, we examined multivariate correlation structures between host intestinal mRNA gene signatures and biological processes/metabolic pathways in the gut metagenome of exclusively BF and FF infants at 3 months of age. Microbial composition of the same stool samples was assessed by metagenomic pyrosequencing, thereby providing a picture of the integrated gut/microbial ecosystem. Consistent with previous reports [8–10], the bacterial microbiome phylogenetic profiles strongly characterized the two groups of infants (FF and BF).
There are at least two viable approaches for uncovering the interdependencies between the intestinal transcriptome and the microbiome of the developing infant gut with respect to diet (BF versus FF). The first approach is to evaluate each data set independently on a variable-by-variable basis, and then produce one feature list for each data set in order to explore possible qualitative relationships between the feature lists. For example, Mulder and colleagues  performed traditional metagenomic and differential expression analyses and highlighted plausible relationships between the prominent results from each of the two analyses. The second approach is to analyze the two data sets simultaneously under an analytical framework designed to assess the 'many to many' multivariate relationship between the two variable sets. This provides a quantifiable and objective way to discover and evaluate multivariate relationships between data sets. For example, CCA has been used to evaluate the multivariate relationships between salt-water microbiomes and environmental variables, such as salinity, sample depth, water column depth, temperature and chlorophyll content . We used elements of both approaches to examine potential relationships underlying interactions between the microbiota and the host transcriptome in the developing infant gut. First, an independent analysis of each data set was used to reduce the number of variables under consideration to a computationally tractable number that could be readily assessed by the methods we present. Secondly, based on the previous findings, an analytical multivariate assessment of the microbiome/transcriptome structure was used to inform our qualitative interpretation of the connections between the two.
By initially examining the metagenomic data, we noted that the 'resistance to antibiotics and toxic compounds', 'Type III, Type IV, ESAT secretion systems', 'iron scavenging mechanisms' and 'virulence' characteristics of the microbiome exhibited evidence of differential sensitivity to breast milk as compared to formula. Specifically, while other microbiome characteristics did not provide strong evidence of differentiation between BF and FF infants upon adjustment for multiple testing, virulence-related microbial genes remained strong. Therefore, we focused our transcriptomic analyses on host immunity and defense-associated genes. Additionally, since we were studying the developing human gut, genes known to be involved in intestinal biology were also examined. Our initial differential expression analysis suggested that our prior knowledge gene sets were targeting relevant gene sets.
Next, the metagenomic and transcriptomic data sets were analytically combined in a multivariate analysis that allowed us to assess the strength of the multivariate relationship between the virulence variables of the microbiome and the gut transcriptome genes under examination. Ranking of the best performing genes under consideration indicated that the strongest multivariate relationship with the virulence characteristics were immunity and defense genes. The credibility of this finding was supported by the a priori expectation that the strongest relationships with the virulence characteristics of the microbiome would be the immunity and defense genes, and the comparison to 1,000 random gene sets to which the immunity and defense gene set ranked in the 96.9th percentile with respect to the measure of transcriptome/microbiome multivariate strength. While the intestinal biology showed some strong multivariate relationships to the microbiome virulence characteristics, they were not unusual compared to the random gene sets, and certainly not as strong as those of the immunity and defense genes (Figure 4).
By adapting CCA outcomes, we identified a subset of 11 immunity and mucosal defense-related genes (REL, NDST1, AOC3, VAV2, ALOX5, SP2, BPIL1, DUOX2, KLRF1, IL1A, and TACR1) that exhibited evidence of a multivariate relationship with microbiome virulence characteristics (Figure 6). Although it is premature to assign cause and effect, we conjecture these genes are reacting concordantly in response to microbial conditions. It is interesting to note that genes that modulate gut motility (tachykinin receptor 1, TACR1)  and bacterial-mediated reactive oxygen species signaling/epithelial homeostasis (dual oxidase 2, DUOX2; Vav 2 guanine nucleotide exchange factor, VAV2; v-rel reticuloendotheliosis viral oncogene homologue, REL) [32–36], were up-regulated in BF versus FF infants (Table 2). In contrast, genes that prime mucosal inflammatory responses (killer cell lectin-like receptor subfamily F-member 1, KLRF1; bactericidal/permeability-increasing protein-like 1, BPIL1; arachidonate 5-lipoxygenase, ALOX5; interleukin 1 alpha, IL1A; vascular adhesion protein 1, AOC3) [37–39], were down-regulated in BF versus FF infants. Collectively, these data are consistent with previous findings that breastfeeding facilitates the adaptive, functional changes required for optimal transition from intrauterine to extrauterine life . Our systems-level analyses support previous studies showing that human milk optimally promotes the mutualistic crosstalk been the mucosal immune system and the microbiome in the maintenance of intestinal homeostasis [8, 9, 25, 27, 28].
We have identified a subset of 11 immunity/defense-related genes that exhibited evidence of a multivariate relationship with microbiome virulence and invasiveness characteristics. To our knowledge, this is the first time an assessment of the multivariate relationship between the microbiome and the host transcriptome has been used to identify intestinal genes potentially important in microbiome regulatory pathways and the integrative gut development process. Arguably, by examining the multivariate structure underlying the microbiome and gut transcriptome, our approach leverages richer and fuller information content compared to analyses focusing on single data sets (for example, only host transcriptome data, or only microbiome data) and only single variables (for example, gene by gene differential expression testing). Our study provides a systematic and statistically rigorous analytical framework for the examination of both host and microbial responses to dietary/environmental components in the early neonatal period. Finally, the novel methodology described here for multivariate correlation analysis of host transcriptome and microbiome can be successfully applied to a large variety of host/microbial commensal studies. The use of CCA can support the formulation of hypothesis-based studies by accurately identifying those genes active in commensal microbiome and host activities.
Materials and methods
Ethics statement and subject recruitment
The experimental human protocol was approved by the University of Illinois and Texas A&M University Institutional Review Boards and informed consent was obtained from parents prior to participation in the study. Details of the study admission criteria and protocols have been previously described . Healthy, full-term infants who were exclusively breast-fed or fed commercially available infant formula (Enfamil LIPIL, Mead Johnson Nutrition, Evansville, IN, USA) and medically certified as healthy were eligible for enrollment into the study. For each infant in the study, stool samples were collected at three months after birth.
Isolation of stool DNA
Genomic DNA was extracted using a modification of the method of Yu and Morrison . Deviation from the protocol included the use of glass beater steps. Briefly, 250 mg (wet weight) of feces was weighed into a 2.0 ml tube containing glass matrix E (MP Biomedicals, Solon, OH, USA). Lysis buffer (1 ml; 500 mM NaCl, 50 mM Tris-HCl, 50 mM EDTA, 4% sodium dodecyl sulfate) was added to the tube and shaken for 30 s. Samples were then incubated at 70°C for 15 minutes. After centrifugation at 16,000 g for 5 minutes, supernatants were collected into 2.0 ml tubes. Lysis buffer (300 µl) was subsequently added and the above steps were repeated. Nucleic acids in the supernatant were precipitated sequentially with ammonium acetate and isopropanol, and dissolved in TE buffer. The precipitated nucleic acids were then treated with DNase-free RNase, proteinase K, and further purified on a QIAamp spin column from a QIAgen DNA Mini Stool Kit (Qiagen, Valencia, CA, USA). DNA quality was checked on 1% agarose gels followed by ethidium bromide staining. DNA from three to four extractions per sample was pooled and its concentration quantified using a NanoDrop 1000 spectrophotometer (NanoDrop Technologies, Wilmington DE, USA).
Sequencing of gut microbiomes
DNA from fecal samples was submitted to the high throughput sequencing and genotyping unit at the Keck Center for Comparative and Functional Genomics, University of Illinois. Samples were sequenced using a 454 Life Sciences Genome Sequencer FLX with GS FLX Titanium series reagents (Roche, Nutly, NJ, USA). Briefly, DNA was fractionated (fragments of 500 to 800 bp) and polished. Subsequently, barcodes containing adaptors A and B were ligated to the ends according to Roche's instructions. Adaptor B contained a 5'-biotin tag to immobilize the DNA library on streptavidin beads. After nick repair, the non-biotinylated strand was released and used as a single-stranded template DNA (sstDNA) library. Library quantity was assessed using Qubit reagents (Invitrogen, CA, USA) and pooled to equal molarity. The optimal level of DNA for emulsion PCR was determined by titration. Beads were loaded onto a PicoTiterPlate device for shotgun sequencing. Signal processing was performed using Roche software.
Host gut mRNA transcriptome analysis
From each subject, poly A+ RNA was isolated from feces as previously described [12, 41]. Due to the high level of bacterial RNA in fecal samples, poly A+ RNA was isolated in order to obtain a highly enriched mammalian RNA population . In addition, an Agilent 2100 Bioanalyzer was used to assess integrity of exfoliated cell poly A+ RNA and quantification was performed on a NanoDrop Spectrophotometer. Samples were processed in strict accordance to the CodeLink™ Gene Expression Assay manual (Applied Microarray, Tempe, AZ, USA) and analyzed using the Human Whole Genome Expression Bioarray as we have previously described .
The microarray data have been previously processed and analyzed . Technical errors in the probes were relatively rare, with approximately 2.5% of the probes being flagged. Nonetheless, thorough quality control processing resulted in 16,767 probes available for analysis. The log base 2 transformed expression data were normalized using two methods, standard loess normalization  as shown in Additional file 2, and a weighted median adjustment method . Subsequently, and based on our findings in the microbiome sequence data, two data sets were constructed using curated gene lists based on literature reviews, functional gene assignments from PANTHER biological processes  and DAVID [44, 45]. Using this prior knowledge, discrete sets of biomarkers (genes) known to be involved in intestinal biology (459) and immunity and defense (660) (see Additional files 8 and 9) [12, 13] were generated. Focusing and targeting the scope of the data under consideration in a biologically meaningful way (i) reduces the dimension of the data being analyzed and protects against extensive multiple testing, (ii) allows for exhaustive examination of all small feature subsets (all three-gene sets) and thereby avoids feature selection, which is known to be highly unreliable in small sample settings , and (iii) allows for computational tractability and analysis feasibility. GO enrichment analyses were performed using the GO Fat gene ontology functional annotation tool , available on DAVID [44, 45]. The expression values for enriched gene subsets were assessed using a permutation test and corrected for multiple testing discovery rate correction (FDR) .
Metagenomic data analysis
Shotgun 454 read data were preprocessed in the following manner: (i) low quality reads were removed if the read mean Phred value was <20 and/or when two or more consecutive nucleotides exhibited a Phred value <20; (ii) reads were clustered using CD-HIT-454  at 100% removing duplicates; (iii) the remaining shotgun sequences were analyzed using the MG-RASTv2 pipeline and the phylogenetic distribution and metabolic functional composition of the samples were profiled . Representative MG-RAST sample statistics are shown in Additional file 5. Taxonomic classifications were assessed in two ways. First, identifiable 16S fragments in the shotgun sequences were used to align to the GreenGenes small subunit rRNA database . Second, PhymmBL  was used as an additional metagenomic phylogenetic classification tool. PhymmBL uses BLAST and interpolated Markov models to taxonomically classify DNA sequences, including reads as short as 100 bp. In addition, to filter out possible human contamination from the reads, BLAST was used to compare all reads to the human genome (Genome Reference Consortium assembly, version 37, 2009 ). Any full length reads that were 100% identical to the human reference genome were discarded. Reads that were 100% identical, but whose length was under 80 amino acids or did not share a full-length alignment with the human reference sequence were not discarded. Between 0 and 13,222 reads were discarded from each sample. The percentage of discarded reads did not exceed 4.3% (13,222 discarded from sample 6) and in all other samples was lower than 0.4%.
Microbiota functional characteristics in BF and FF infants were compared. Additional file 10 provides a breakdown of the average number of reads across samples that were mapped to functional SEED categories. Because of the hierarchical structure of the SEED classification system, aggregating reads into coarser classifications provided for a more informed analysis. For SEED level 2 classifications, approximately 25% of the functional classifications exhibited an average number of 200 reads across samples. At SEED level 3, approximately 6% of the functional classifications had an average number of 200 reads. Subsequently, comparisons between functional categories were carried out subject to the following restrictions: SEED level 1 functional categories were compared if all the FF or BF samples had at least 200 reads from each sample and at least 2% of the overall number of reads of each sample; SEED level 2 functional categories were compared if all the FF or BF samples had at least 100 reads from each sample and at least 0.5% of the overall number of reads of each sample; SEED level 3 functional categories were compared if all the FF or BF samples had at least 50 reads from each sample and at least 0.1% of the overall number of reads of each sample.
Gut metagenome and host transcriptome data integration
In order to take into account multivariate structure when assessing and ranking genes, we analytically quantified the multivariate relationships between the metagenomic and transcriptomic data. CCA was used to uncover the multivariate structure between the metagenome and host transcriptome data sets, which is discussed in more detail in Additional file 11. CCA is a multivariate analysis method and provides measures of the strength - that is, canonical correlations - of multivariate association between variable sets as well as a means to interpret the role of the variables in terms of the underlying multivariate relationship [21–23]. The implicitly linear relationship embedded in CCA is targeted at the simplest first-order relationships that might be detectable between two data sets. Certainly, non-linear relationships are possible, and will not be detected by CCA methodology. However, such relationships are extremely difficult to estimate without large sample sizes, which are difficult to obtain in clinical settings involving infants, and thus we did not attempt to capture them given the small sample size of our current cohort. Since the CCA method is based on an estimate of the covariance matrix between the two variable sets, it is unreliable when the number of variables is large relative to the number of samples being used to estimate the covariance structure. Because of the limited number of subjects (six per treatment group), it was not possible to exhaustively examine all the microbiome and transcriptome outcomes simultaneously. Therefore, we repeatedly applied CCA to all subsets (of size three) of host gene expression variables combined with the metagenomic data (three variables). We refer to the subsets (of size three) of the gene expression variables as gene triples. By analyzing all gene triples in turn with the virulence characteristics, we examined the multivariate structure between the gut metagenome and host transcriptome in a piecewise, sub-dimensional manner. CCA results using either the loess normalization method or the weighted median adjustment normalization method were very similar (data not shown). Thus, only loess normalized data are presented.
As a result of our preliminary analysis of the metagenomic data, we targeted the SEED level 2 virulence characteristics for integration with the presumed relevant host gut gene expression data (immunological and defense genes as well as intestinal biology genes as described in the 'Host gut mRNA transcriptome analysis' section above). Four ('resistance to antibiotics and toxic compounds', 'Type III, Type IV, ESAT secretion systems', 'iron scavenging mechanisms', and 'virulence') out of thirteen ('invasion and intracellular resistance', 'prophage, transposon', 'toxins and superantigens', 'Type VI secretion systems', 'detection', 'pathogenicity islands', 'regulation of virulence', 'adhesion', and 'posttranslational modification' in addition to the preceding four categories) SEED level 2 virulence categories had more than ten sequence reads for each sample. We subsequently discarded the catch all 'virulence' category and used 'resistance to antibiotics and toxic compounds', 'Type III, Type IV, ESAT secretion systems', and 'iron scavenging mechanisms' as our so called 'virulence characteristics'. Each sample had more than 30 sequence reads representing each category and more than 50 sequence reads for all samples in either BF or FF groups. The average number of sequence reads was 245 over all categories and infants. Read count proportions were ultimately used in the CCA analysis. For integration with the virulence variables, we used the 100 of 660 immunological and defense genes and the 100 of 459 intestinal biology genes that had the smallest P-values. This was done to avoid a computationally prohibitive combinatorial explosion in the number of gene triples to be analyzed. The overall result of our approach was a list of 'best' host genes (out of those considered), that is, those showing the strongest empirical evidence of a relationship with the gut metagenome as judged by multivariate association and structure.
With regard to mathematical modeling, there is some similarity between CCA and principal components analysis (PCA). PCA is frequently used in high dimensional settings to uncover structure in the data, perhaps in conjunction with clustering methodologies, and to generally reduce data dimensionality. While there are slight differences in the mathematical optimization specification of PCA and CCA, they perform highly related analyses. The primary advantage of CCA in the present setting, however, is that it is specifically designed to uncover the multivariate structure between two distinct data sets. PCA makes no such prior distinction between data sets and thus does not specifically target the multivariate structure between two distinct data sets. We initially explored the use of PCA, but found CCA more adequately suited to the primary task of data integration. Additional file 12 shows the amount of variation explained by the first and second principal components for each gene triple/metagenome set examined. Sets characterized by only a few principal components would be expected to be potential candidates for strong performance under CCA; however, since the principal components in PCA do not necessarily target the underlying relationship between gene triples and the metagenome, they may instead identify factors influencing only gene triples or only the metagenome.
The raw metagenome sequence data minus human-identical sequences are available at the European Bioinformatics Institute's Short Read Archive (study accession number: ERP001038). The human microarray data discussed in this publication have been deposited in NCBI's Gene Expression Omnibus  and are accessible through GEO Series accession number GSE31075.
canonical correlation analysis
false discovery rate
linear discriminant analysis
principal components analysis.
This work was supported by National Institute of Health grants CA129444, HD61929, R25T-CA090301, Hatch project ILLU-971-346 through the Division of Nutritional Sciences Vision 20/20 program and USDA-NIFA Grant Designing Foods for Health, 2010-34402-20875. IF is supported by startup funding from the College of Arts and Science at Miami University.
- Willing BP, Van Kessel AG: Enterocyte proliferation and apoptosis in the caudal small intestine is influenced by the composition of colonizing commensal bacteria in the neonatal gnotobiotic pig. J Anim Sci. 2007, 85: 3256-3266. 10.2527/jas.2007-0320.PubMedView ArticleGoogle Scholar
- Willing BP, Van Kessel AG: Intestinal microbiota differentially affect brush border enzyme activity and gene expression in the neonatal gnotobiotic pig. J Anim Physiol Anim Nutr (Berl). 2009, 93: 586-595. 10.1111/j.1439-0396.2008.00841.x.View ArticleGoogle Scholar
- Meurens F, Berri M, Siggers RH, Willing BP, Salmon H, Van Kessel AG, Gerdts V: Commensal bacteria and expression of two major intestinal chemokines, TECK/CCL25 and MEC/CCL28, and their receptors. PLoS One. 2007, 2: e677-10.1371/journal.pone.0000677.PubMedPubMed CentralView ArticleGoogle Scholar
- Van den Abbeele P, Van de Wiele T, Verstraete W, Possemiers S: The host selects mucosal and luminal associations of coevolved gut microorganisms: a novel concept. FEMS Microbiol Rev. 2011, 35: 681-704. 10.1111/j.1574-6976.2011.00270.x.PubMedView ArticleGoogle Scholar
- Chowdhury SR, King DE, Willing BP, Band MR, Beever JE, Lane AB, Loor JJ, Marini JC, Rund LA, Schook LB, Van Kessel AG, Gaskins HR: Transcriptome profiling of the small intestinal epithelium in germfree versus conventional piglets. BMC Genomics. 2007, 8: 215-10.1186/1471-2164-8-215.PubMedPubMed CentralView ArticleGoogle Scholar
- Phelan VV, Liu WT, Pogliano K, Dorrestein PC: Microbial metabolic exchange-the chemotype-to-phenotype link. Nat Chem Biol. 2012, 8: 26-35.View ArticleGoogle Scholar
- Dumas ME: The microbial-mammalian metabolic axis: beyond simple metabolism. Cell Metab. 2011, 13: 489-490. 10.1016/j.cmet.2011.04.005.PubMedView ArticleGoogle Scholar
- Poroyko V, White JR, Wang M, Donovan S, Alverdy J, Liu DC, Morowitz MJ: Gut microbial gene expression in mother-fed and formula-fed piglets. PLoS One. 2010, 5: e12459-10.1371/journal.pone.0012459.PubMedPubMed CentralView ArticleGoogle Scholar
- Adlerberth I, Wold AE: Establishment of the gut microbiota in Western infants. Acta Paediatr. 2009, 98: 229-238. 10.1111/j.1651-2227.2008.01060.x.PubMedView ArticleGoogle Scholar
- Marques TM, Wall R, Ross RP, Fitzgerald GF, Ryan CA, Stanton C: Programming infant gut microbiota: influence of dietary and environmental factors. Curr Opin Biotechnol. 2010, 21: 149-156. 10.1016/j.copbio.2010.03.020.PubMedView ArticleGoogle Scholar
- Chapkin RS, McMurray DN, Lupton JR: Colon cancer, fatty acids and anti-inflammatory compounds. Curr Opin Gastroenterol. 2007, 23: 48-54. 10.1097/MOG.0b013e32801145d7.PubMedView ArticleGoogle Scholar
- Chapkin RS, Zhao C, Ivanov I, Davidson LA, Goldsby JS, Lupton JR, Mathai RA, Monaco MH, Rai D, Russell WM, Donovan SM, Dougherty ER: Noninvasive stool-based detection of infant gastrointestinal development using gene expression profiles from exfoliated epithelial cells. Am J Physiol Gastrointest Liver Physiol. 2010, 298: G582-589. 10.1152/ajpgi.00004.2010.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao C, Ivanov I, Dougherty ER, Hartman TJ, Lanza E, Bobe G, Colburn NH, Lupton JR, Davidson LA, Chapkin RS: Noninvasive detection of candidate molecular biomarkers in subjects with a history of insulin resistance and colorectal adenomas. Cancer Prev Res (Phila). 2009, 2: 590-597. 10.1158/1940-6207.CAPR-08-0233.View ArticleGoogle Scholar
- Donovan SM, Monaco MH, Drnevich JM, Hernell O, Kvistgaard AS, Lonnerdal B: Transcriptional responses of the neonatal rhesus intestine to osteopontin. J Pediatr Gastroenterol Nutr. 2011, 52: E62-Google Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B Met. 1995, 57: 289-300.Google Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Brady A, Salzberg SL: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods. 2009, 6: 673-676. 10.1038/nmeth.1358.PubMedPubMed CentralView ArticleGoogle Scholar
- Koenig JE, Spor A, Scalfone N, Fricker AD, Stombaugh J, Knight R, Angenent LT, Ley RE: Succession of microbial consortia in the developing infant gut microbiome. Proc Natl Acad Sci USA. 2011, 108 (Suppl 1): 4578-4585.PubMedPubMed CentralView ArticleGoogle Scholar
- Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O: The RAST Server: rapid annotations using subsystems technology. BMC Genomics. 2008, 9: 75-10.1186/1471-2164-9-75.PubMedPubMed CentralView ArticleGoogle Scholar
- Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, et al: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005, 33: 5691-5702. 10.1093/nar/gki866.PubMedPubMed CentralView ArticleGoogle Scholar
- Cooley WW, Lohnes PR: Multivariate Data Analysis. 1971, New York: WileyGoogle Scholar
- Dunteman GH: Introduction to Multivariate Analysis. 1984, Beverly Hills: Sage PublicationsGoogle Scholar
- Krzanowski WJ: Principles of Multivariate Analysis: a User's Perspective. 2000, New York: Oxford University Press, RevisedGoogle Scholar
- de Santa Barbara P, van den Brink GR, Roberts DJ: Development and differentiation of the intestinal epithelium. Cell Mol Life Sci. 2003, 60: 1322-1332. 10.1007/s00018-003-2289-3.PubMedPubMed CentralView ArticleGoogle Scholar
- Palmer AC: Nutritionally mediated programming of the developing immune system. Adv Nutr. 2011, 2: 377-395.PubMedPubMed CentralView ArticleGoogle Scholar
- Cummins AG, Thompson FM: Postnatal changes in mucosal immune response: a physiological perspective of breast feeding and weaning. Immunol Cell Biol. 1997, 75: 419-429. 10.1038/icb.1997.67.PubMedView ArticleGoogle Scholar
- Donovan SM: Role of human milk components in gastrointestinal development: Current knowledge and future needs. J Pediatr. 2006, 149: S49-S61. 10.1016/j.jpeds.2006.06.052.View ArticleGoogle Scholar
- Donovan SM, Odle J: Growth factors in milk as mediators of infant development. Annu Rev Nutr. 1994, 14: 147-167. 10.1146/annurev.nu.14.070194.001051.PubMedView ArticleGoogle Scholar
- Mulder IE, Schmidt B, Stokes CR, Lewis M, Bailey M, Aminov RI, Prosser JI, Gill BP, Pluske JR, Mayer CD, Musk CC, Kelly D: Environmentally-acquired bacteria influence microbial diversity and natural innate immune responses at gut surfaces. BMC Biol. 2009, 7: 79-10.1186/1741-7007-7-79.PubMedPubMed CentralView ArticleGoogle Scholar
- Gianoulis TA, Raes J, Patel PV, Bjornson R, Korbel JO, Letunic I, Yamada T, Paccanaro A, Jensen LJ, Snyder M, Bork P, Gerstein MB: Quantifying environmental adaptation of metabolic pathways in metagenomics. Proc Natl Acad Sci USA. 2009, 106: 1374-1379. 10.1073/pnas.0808022106.PubMedPubMed CentralView ArticleGoogle Scholar
- Sanger GJ: Neurokinin NK1 and NK3 receptors as targets for drugs to treat gastrointestinal motility disorders and pain. Br J Pharmacol. 2004, 141: 1303-1312. 10.1038/sj.bjp.0705742.PubMedPubMed CentralView ArticleGoogle Scholar
- Geiszt M, Witta J, Baffi J, Lekstrom K, Leto TL: Dual oxidases represent novel hydrogen peroxide sources supporting mucosal surface host defense. FASEB J. 2003, 17: 1502-1504.PubMedGoogle Scholar
- Lipinski S, Till A, Sina C, Arlt A, Grasberger H, Schreiber S, Rosenstiel P: DUOX2-derived reactive oxygen species are effectors of NOD2-mediated antibacterial responses. J Cell Sci. 2009, 122: 3522-3530. 10.1242/jcs.050690.PubMedView ArticleGoogle Scholar
- Schmitter T, Pils S, Sakk V, Frank R, Fischer KD, Hauck CR: The granulocyte receptor carcinoembryonic antigen-related cell adhesion molecule 3 (CEACAM3) directly associates with Vav to promote phagocytosis of human pathogens. J Immunol. 2007, 178: 3797-3805.PubMedView ArticleGoogle Scholar
- Steinbrecher KA, Harmel-Laws E, Sitcheran R, Baldwin AS: Loss of epithelial RelA results in deregulated intestinal proliferative/apoptotic homeostasis and susceptibility to inflammation. J Immunol. 2008, 180: 2588-2599.PubMedView ArticleGoogle Scholar
- Swanson PA, Kumar A, Samarin S, Vijay-Kumar M, Kundu K, Murthy N, Hansen J, Nusrat A, Neish AS: Enteric commensal bacteria potentiate epithelial restitution via reactive oxygen species-mediated inactivation of focal adhesion kinase phosphatases. Proc Natl Acad Sci USA. 2011, 108: 8803-8808. 10.1073/pnas.1010042108.PubMedPubMed CentralView ArticleGoogle Scholar
- Salmi M, Jalkanen S: VAP-1: an adhesin and an enzyme. Trends Immunol. 2001, 22: 211-216. 10.1016/S1471-4906(01)01870-1.PubMedView ArticleGoogle Scholar
- Kuttruff S, Koch S, Kelp A, Pawelec G, Rammensee HG, Steinle A: NKp80 defines and stimulates a reactive subset of CD8 T cells. Blood. 2009, 113: 358-369.PubMedView ArticleGoogle Scholar
- Nagashima T, Ichimiya S, Kikuchi T, Saito Y, Matsumiya H, Ara S, Koshiba S, Zhang J, Hatate C, Tonooka A, Kubo T, Ye RC, Hirose B, Shirasaki H, Izumi T, Takami T, Himi T, Sato N: Arachidonate 5-lipoxygenase establishes adaptive humoral immunity by controlling primary B cells and their cognate T-cell help. Am J Pathol. 2011, 178: 222-232. 10.1016/j.ajpath.2010.11.033.PubMedPubMed CentralView ArticleGoogle Scholar
- Yu Z, Morrison M: Improved extraction of PCR-quality community DNA from digesta and fecal samples. Biotechniques. 2004, 36: 808-812.PubMedGoogle Scholar
- Davidson LA, Lupton JR, Miskovsky E, Fields AP, Chapkin RS: Quantification of human intestinal gene expression profiles using exfoliated colonocytes: a pilot study. Biomarkers. 2003, 8: 51-61. 10.1080/1354750021000042268.PubMedView ArticleGoogle Scholar
- Smyth GK, Speed T: Normalization of cDNA microarray data. Methods. 2003, 31: 265-273. 10.1016/S1046-2023(03)00155-5.PubMedView ArticleGoogle Scholar
- Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003, 13: 2129-2141. 10.1101/gr.772403.PubMedPubMed CentralView ArticleGoogle Scholar
- Huang da W, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37: 1-13. 10.1093/nar/gkn923.PubMedPubMed CentralView ArticleGoogle Scholar
- Huang da W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4: 44-57.PubMedView ArticleGoogle Scholar
- Sima C, Dougherty ER: What should be expected from feature selection in small-sample settings. Bioinformatics. 2006, 22: 2430-2436. 10.1093/bioinformatics/btl407.PubMedView ArticleGoogle Scholar
- Niu B, Fu L, Sun S, Li W: Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics. 2010, 11: 187-10.1186/1471-2105-11-187.PubMedPubMed CentralView ArticleGoogle Scholar
- Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA: The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008, 9: 386-10.1186/1471-2105-9-386.PubMedPubMed CentralView ArticleGoogle Scholar
- DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL: Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 2006, 72: 5069-5072. 10.1128/AEM.03006-05.PubMedPubMed CentralView ArticleGoogle Scholar
- Genome Reference Constortium. [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/]
- Hotelling H: Relations between two sets of variates. Biometrika. 1936, 28: 321-377.View ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207-210. 10.1093/nar/30.1.207.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.