Approaches for integrating heterogeneous RNA-seq data reveals cross-talk between microbes and genes in asthmatic patients 2

Sputum induction is a non-invasive method to evaluate the airway environment, particularly for asthma. RNA sequencing (RNAseq) can be used on sputum, but it can be challenging to interpret because sputum contains a complex and heterogeneous mixture of human cells and exogenous (microbial) material. In this study, we developed a methodology that integrates dimensionality reduction and statistical modeling to grapple with the 22 heterogeneity. We use this to relate bulk RNAseq data from 115 asthmatic patients with clinical information, microscope images, and single-cell profiles. First, we mapped sputum RNAseq to human and exogenous sources. Next, we decomposed the human reads into cell-expression signatures and fractions of these in each 25 sample; we validated the decomposition using targeted single-cell RNAseq and microscopy. We observed enrichment of immune-system cells (neutrophils, eosinophils, and mast cells) in severe asthmatics. Second, 27 we inferred microbial abundances from the exogenous reads and then associated these with clinical variables - 28 - e.g., Haemophilu s was associated with increased white blood cell count and Candid a, with worse lung 29 function. Third, we applied a generative model, Latent Dirichlet allocation (LDA), to identify patterns of gene 30 expression and microbial abundances and relate them to clinical data. Based on this, we developed a method called LDA-link that connects microbes to genes using reduced-dimensionality LDA topics. We found a number 32 of known connections, e.g. between Haemophilus and the gene IL1B, which is highly expressed by mast cells. In addition, we identified novel connections, including Candida and the calcium-signaling gene CACNA1E, which is highly expressed by eosinophils. These results speak to the mechanism by which gene-microbe 35 interactions contribute to asthma and define a strategy for making inferences in heterogeneous and noisy RNAseq datasets.


ABSTRACT (337 words) 18
Sputum induction is a non-invasive method to evaluate the airway environment, particularly for asthma. RNA 19 sequencing (RNAseq) can be used on sputum, but it can be challenging to interpret because sputum contains 20 a complex and heterogeneous mixture of human cells and exogenous (microbial) material. In this study, we 21 developed a methodology that integrates dimensionality reduction and statistical modeling to grapple with the 22 heterogeneity. We use this to relate bulk RNAseq data from 115 asthmatic patients with clinical information, 23 microscope images, and single-cell profiles. First, we mapped sputum RNAseq to human and exogenous 24 sources. Next, we decomposed the human reads into cell-expression signatures and fractions of these in each 25 sample; we validated the decomposition using targeted single-cell RNAseq and microscopy. We observed 26 enrichment of immune-system cells (neutrophils, eosinophils, and mast cells) in severe asthmatics. Second, 27 we inferred microbial abundances from the exogenous reads and then associated these with clinical variables -28 -e.g., Haemophilus was associated with increased white blood cell count and Candida, with worse lung 29 function. Third, we applied a generative model, Latent Dirichlet allocation (LDA), to identify patterns of gene 30 expression and microbial abundances and relate them to clinical data. Based on this, we developed a method 31 called LDA-link that connects microbes to genes using reduced-dimensionality LDA topics. We found a number 32 of known connections, e.g. between Haemophilus and the gene IL1B, which is highly expressed by mast cells. 33 In addition, we identified novel connections, including Candida and the calcium-signaling gene CACNA1E, 34 which is highly expressed by eosinophils. These results speak to the mechanism by which gene-microbe 35 interactions contribute to asthma and define a strategy for making inferences in heterogeneous and noisy 36 RNAseq datasets.

INTRODUCTION 38
Linking high-dimensional, heterogeneous datasets 39 RNA sequencing (RNAseq) has become a standard method of analyzing complex communities. Depending on 40 the sample type, these data can be very heterogeneous. A key problem tackled in this paper is dealing with the 41 heterogeneity and noise in RNAseq data in complex samples such as sputum. This can be appreciated by 42 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint comparing sputum RNAseq to a more traditional experiment, e.g. blood RNAseq, where the sample can be 43 collected consistently and that contains relatively well-defined cell types (Figure 1). In blood, the vast majority 44 of RNAseq reads align to the human genome, and the goal is often to relate the expression of the genes to a 45 phenotype. By contrast, sputum may be less consistently collected, its cell types are less defined, and it may 46 contain RNA from microbes and other organisms that act as cryptic indicators of the environment. This 47 combination of variables and dimensions often requires researchers to collapse the dimensions to 48 appropriately de-noise the analysis. Here, we present such a strategy that uses a number of supervised and 49 unsupervised techniques such as single-cell signatures and latent Dirichlet allocation (LDA). These techniques 50 can produce a low-dimensional representation of common groups of genes, microbes, or other features that 51 tend to increase or decrease in abundance together. Our approach is useful when the heterogeneity comes 52 from the sample type (e.g., sputum) and especially when the samples derive from a heterogeneous population 53 of individuals, such as patients with asthma. 54

Interactions between the host and microbes in the lung 55
Asthma is a disease of the airway that can present with many clinical phenotypes. Much work has focused on 56 identifying subgroups of the disease and how each subgroup responds to treatment. For example, Yan et al. 57 introduced transcriptional endotypes of asthma and the Severe Asthma Respiratory Phenotype consortium 58 defined five subtypes of asthma [1]. Some of these subgroups respond differently to environmental and 59 microbial triggers, such as fungal spores. Some fungi have well-defined effects in asthma, but the role of many 60 microbes remains contentious. A simplified model assigns microbes to one of three categories: pathogenic 61 organisms that cause inflammation, beneficial organisms that reduce inflammation, and those that have no 62 effect on inflammation. The majority of the organisms in the lungs are expected to have no effect, and severe 63 asthmatics are expected to have more pathogenic and fewer beneficial microbes. 64

Inferring immune cell fractions from RNAseq data 65
The pathology of microbes is often inferred by the number and type of immune cells observed in samples, such 66 as sputum total leukocyte counts [2,3]. A standard method for counting immune cells in sputum samples uses 67 microscopy, but the resolution is limited to a few cell types [4]. Other cell-counting methods such as flow-68 sorting can be challenging because of the viscosity and highly variable cell numbers in sputum. An alternative 69 strategy uses cell-type specific expression patterns to deconvolve RNAseq reads from mixtures of cells into 70 fractions of different immune cells [5]. This deconvolution also effectively de-noises heterogeneous datasets by 71 greatly reducing the number of dimensions. Importantly, the RNA needed for this analysis can be purified 72 without poly-A enrichment-here, we use human ribosomal RNA knockdownwhich allows for the 73 simultaneous analysis of microbial and human transcripts. 74

Supervised deconvolution and the microbiome 75
While deconvolution to cell fractions effectively de-noises human RNAseq data, an equivalent method does not 76 exist for microbes. Although we can map microbe reads onto their genomes, this approach is imperfect 77 because the genome databases are incomplete and assigning a read to a single genome can be complicated if 78 it matches more than one equally well. One can reduce the dimensions by collapsing microbial strains to 79 different taxonomic ranks (e.g., genus or family); however, taxonomy is notoriously imprecise at defining 80 behavior. For example, many bacteria in the genus Escherichia are human commensals, whereas Escherichia 81 coli OH157:H7 causes hemorrhagic colitis. Alternatively, one can group sequences by the metabolic pathways 82 observed, although this requires high-depth sequencing. Here, we propose a method to reduce the 83 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint dimensionality of microbes by first linking the microbes to human genes, and then applying the relatively well-84 defined gene dimensionality-reduction methods (e.g., deconvolution to cell types). 85 86 In this paper, we use RNAseq of sputum samples from asthmatic patients to demonstrate dimensionality-87 reduction strategies and identify microbe-host relationships. We map RNAseq reads onto human or microbial 88 genomes and relate the resulting abundance matrices to each other and to clinical data. Further, we 89 deconvolve the human reads into fractions of the various cell types that make up sputum. Finally, we relate the 90 human genes and microbes using a method we call LDA-link, which identifies relationships between genes, 91 microbes, and cell types. These methods represent a general strategy for dealing with heterogeneous RNAseq 92 data that is applicable to other sample types beyond sputum. 93 94

RESULTS 95
Sequencing and processing with the extracellular RNA processing toolkit (exceRpt) pipeline 96 We collected induced sputum samples from 115 patients with heterogeneous asthma phenotypes and 97 sequenced these sample using RNAseq. The median read depth per sample was 47.5 million, which meets 98 depth recommendations for analyses of this type [6]. We processed these reads through the exceRpt pipeline 99 [7], which conservatively matches reads to genomes in a sequential order designed to reduce experimental 00 artifacts. In brief, we first aligned the quality filtered reads to the UniVec database of common laboratory 01 contaminants 2 , and then aligned the remaining reads to human ribosomal sequences before aligning them to 02 the human genome. We excluded samples with a low ratio of transcript alignments to intergenic sequence 03 alignments, and then aligned the remaining reads to the comparably large sequence space of non-human 04 genomes. We first aligned reads to the relatively well-curated ribosomal databases of bacteria, fungi, and 05 archaea (e.g., Ribosomal Database Project 3 ) and then to curated genomes of bacteria, fungi, viruses, plants, 06 and animals. The percent of reads mapping to different biotypes was highly heterogeneous; a median of 60% 07 of the reads aligned to the human reference genome and 50% to annotated transcripts ( Figure 1, green bars). 08 A median of 0.7% of the input reads aligned to exogenous sources, with some samples containing as much as 09 28.1% exogenous reads. As a control, we applied the same protocol to blood samples, which demonstrated 10 more homogeneity than sputum (Figure 1, top, "blood"). 11 Overview of the analysis approach 12 The goal of the analysis was to infer meaningful relationships between the numbers and origins of the RNAseq 13 reads and relate them to clinical phenotypes. We conceptualized the clinical information and RNAseq 14 alignments as a series of tables (Figure 1) . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint Individual correlations can be difficult to interpret, particularly in heterogeneous, sparse, or noisy datasets. 26 Organizing the genes into relevant pathways or cell types can reduce the dimensionality and de-noise the 27 analysis. To this end, we deconvolved G ( × ) into a cell-type fraction table, F ( × ) , and a cell-type 28 signatures table, S ( × ) . However, an analogous supervised method does not exist for the microbes. 29 Therefore, we applied an unsupervised dimensionality-reduction approach, latent dirichlet allocation (LDA), 30 which provides a topic distributions in patients (θ G , × ) across a smaller number ( =10) of topics and 31 gene topic (and φ G , × ). This can also be done to the microbe table M and get θ M and φ M , and the gene 32 and microbe topic can be correlated (e.g. R( •, , •, ) over all patients). 33 34 The framework described above is useful for identifying linear relationships, but non-linear relationships are 35 also possible. For example, a microbe sensed by a human immune cell could lead to the activation of a 36 transcription factor and the expression of several genes, each of which would have a non-linear relationship to 37 microbe abundance. To identify such relationships, we applied a non-linear ensemble learning algorithm [8,9], 38 using the de-noised inputs for each gene and microbe (φ G and φ M ). We call this method LDA-link. Further, we 39 relate the gene and microbe links identified to cell fractions and thereby relate how the host is responding to 40 microbes with regards to immune cell type response with a particular gene. 41

Analysis of human-aligned reads 42
Working toward the hypothesis that we can conceptualize human-aligned sputum RNAseq reads as a mixture 43 of immune cell types, each with a distinct expression profile, we deconvolved the Gene This method relies on knowing the signature gene-set in each cell type, which derived from the blood immune 46 cell high quality profiles. To validate that we could apply these cell expression profiles to sputum, we generated 47 several additional datasets including single-cell RNAseq (scRNAseq), microscopy, and unsupervised 48 decomposition, and then compared the results to the deconvolution table F. (Figure 2A, schema). 49

Evaluation of deconvolution results by scRNAseq 50
First, we performed scRNAseq on a cohort of similar sputum samples (five control and five asthmatic patients). 51 The single-cell sequences clustered into four groups ( Figure 2B, first and second panels). To determine 52 whether the reference profiles that we used to deconvolve the bulk RNAseq recapitulate those found in the 53 single-cell clusters, we co-clustered the reference profiles with the scRNAseq data ( Figure 2B, third panel). 54 The reference profiles split into the groups by lineage; for example, those in the lymphoid progenitor line co-55 clustered with cluster 2, and the myeloblast progenitor line co-clustered with cluster 4. This result suggests that 56 the reference profiles accurately represent the cell types in sputum. The myeloid lineage cluster showed a 57 significant difference in the number of cells between asthmatics and controls ( Figure 2C). From this analysis, 58 we concluded that (1) the blood-derived cell profiles appropriately fit the sputum cell types and (2) no additional 59 cell types are needed to deconvolve the sputum bulk RNAseq data. 60

Evaluation of deconvolution results by microscopy 61
Second, we evaluated a subset of the samples by microscopy and manually counted the number of 62 neutrophils, eosinophils, lymphocytes, and macrophages. We found good agreement with F , when cell counts 63 could be directly compared, i.e. neutrophils and eosinophils were both present in F and counted by 64 microscopy. In cases where the deconvolution method gave higher resolution, (e.g., M0, M1, and M2 65 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint macrophages versus one type of macrophage by microscopy), the aggregation of the relevant columns in Ff 66 correlated well with the microscopy counts ( Figure 2D). 67 68

Association of cell fractions with clinical features 69
Having validated the deconvolution of sputum samples (table F), we then correlated the cell fractions with 70 clinical features (R( •, , •, ) for all patients). We found that the changes in fractions of several cell types were 71 highly correlated with clinical features ( Figure 2E). For example, the fraction of T-regulatory cells negatively 72 correlated with the number of hospitalizations per year, suggesting a beneficial role of these cells in the 73 management of asthma. 74 75

Evaluation of deconvolution results by unsupervised decomposition 76
We compared the signal captured by cell-type deconvolution to an unsupervised decomposition method: LDA. 77 Using LDA, we factored the gene expression table into ten topics that conceptually represent gene expression 78 programs. This resulted in a gene-topic-fraction-in-patients found agreement between LDA and the cell-signature-based deconvolution for only the most prominent cell 82 type, neutrophils ( Figure 2D, topic 4). The top genes associated with topic 4 were enriched in the neutrophil 83 chemotaxis pathway (Figure S8 B). 84 85 However, the remaining topics were comprised of multiple cell types. This suggests that LDA can identify 86 distinct but partially overlapping features in G. According to the clustering of θ G , a subgroup of severely 87 asthmatic patients was highly correlated with topic four ( Figure S8A). The top-weighted genes in topic 4 were 88 enriched for the pathways "neutrophil chemotaxis" and "asthma-related genes" (Figure S8B). These pathways 89 were not enriched in the analogous cell-type-signatures table S, suggesting that LDA topics are distinct from 90 the cell-type signatures, but are also clinically relevant. Moreover, the top-weighted genes in topic 1 of the 91 gene topic components table were mitochondrial genes, and topic 1 was strongly correlated with age. This link 92 shows strong support in the literature, as reactive oxygen species produced by the mitochondria reduce their 93 function over time [10]; however, we did not observe this relationship for any cells in the cell-type-fractions 94 table (F). Another method using a very different algorithm than LDA, non-negative matrix factorization (NMF), 95 showed strong agreement with LDA ( Figure S2, Nmf.1). This supports the use of supervised deconvolution 96 methods as picking out interpretable signals that are different than those identified by unsupervised methods. 97 Unsupervised decomposition should be considered a set of features distinct from those found through 98 deconvolution. 99 00

Analysis of exogenous reads 01
After filtering out contaminants and human reads, we assembled the set of reads that aligned to exogenous 02 genomes into a Microbe table (M). The exogenous sequences aligned to mostly bacteria and fungi, although 03 we also observed a few arthropod and helminth reads (Supplemental Table X). The dominant phyla observed 04 were from the bacterial kingdom: Proteobacteria, Firmicutes, and then Bacteroidetes. The abundance of 05 Proteobacteria is in contrast to observations from the gut where Bacterioidetes predominate [11]. Also notable 06 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint was the presence of two phyla of fungi among the eight most abundant overall, although this was in lower 07 abundance than many of the bacterial phyla. 08

Microbes correlations with clinical information and cell fractions 09
We correlated the microbe abundances to clinical information (R( •, , •, ) for all patients) ( Figure 3A). 10 Haemophilus was associated with increased total white blood cell numbers, as has been described previously 11 [12]. Candida was associated with worse lung function test results (e.g., forced expiratory volume and forced 12 vital capacity), which supports the association with a severe form of asthma characterized by eosinophilia [13]. 13 14 We next correlated microbe abundances to human immune cell fractions (R( •, , •, ) for all patients) ( Figure  15 3B). Several correlations demonstrated results with strong literature precedence. For example, studies have 16 previously shown that Haemophilus associates with eosinophilia [14], and we observed a significant correlation 17 between Haemophilus and the fraction of eosinophils. We also observed a significant correlation between 18 Haemophilus and activated mast cells, suggesting an alternative route to Haemophilus-induced inflammation 19 [15]. Moreover, the fungal genus Candida was also significantly correlated with eosinophils, even more 20 strongly than Haemophilus. Pulmonary candidiasis has long been associated with allergic bronchial asthma 21 and inflammation [16], however few lung microbiome studies have examined both bacterial and fungal signals. 22 This highlights the need for a more comprehensive search of the lung microbiome and demonstrates the power 23 of an RNAseq-based method that can report on all kingdoms with the same sample preparation. 24

Dimensionality reduction for microbes: clustering and networks 25
We attempted to de-noise the microbe table ( ) with a variety of dimensionality-reduction techniques. 26 First, we collapsed the microbes by taxonomy, grouping them to the rank of phylum, and then hierarchically 27 cluster the patients based on their phylum abundance ( Figure 3C HierClust( )). The hierarchical 28 clustering showed that the phylum distributions formed three clusters of patients. We related these clusters to 29 the clinical variable "asthma severity" and observed that cluster 2 was enriched for patients identified as having 30 moderate or severe asthma. This cluster was characterized by the highest relative abundance of the phylum 31 Proteobacteria ( Figure 3C). Notably, the genus Haemophilus belongs to this phylum, consistent with the 32 correlations observed at the genus rank ( Figures 3A, 3B). 33 34 Similarly, we could de-noise the microbe table using a co-abundance network, by correlating the genus-level 35 abundances (R( •, , •, ) and identifying significant modules (Supplemental Figure Z). An interpretation of 36 these modules is that they define metabolic niches, where microbes either directly compete for metabolites or 37 there is interdependency in metabolite production. Such networks could be created from other tables, such as 38 the topic distribution of microbes (R( •, , •, ) for all the topics) ( Figure 3D). These modules represent 39 another unit that could be related to the clinical information (C) and the cell-type fractions (F). 40

LDA-link for the identification of links between genes and microbes 41
How much cross-talk exists between microbes and human cells in the airway remains contentious [17]. We feel 42 this is partly due to the heterogeneous and noisy data from airway samples, where it is often difficult to find 43 strong correlations using standard algorithms. We therefore sought to link genes to microbes via a new method 44 called LDA-link. 45 46 LDA-link connects genes to microbes using a combination of linear correlation, unsupervised decomposition 47 and an ensemble learning classifier. We hypothesized that the only strongest gene and microbe correlations 48 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint would be observable through the noise in the RNAseq data. Therefore, we used these strong links as a training 49 set to find other links, after taking steps to reduce the noise in the data. We reduced the noise using LDA and 50 then identified links using a random forest classifier, described in more detail below and in the methods 51 section. 52 53 To define the training, set we first related columns between the gene and microbe tables (R( •, , •, )), 54 yielding many low-scoring correlations. However, a relatively small number were strong (R > 0.4) and highly 55 significant (p < 1E-5 after FDR correction) ( Figure 4A). We selected the very strong correlations as true-56 positive links between genes and microbes in the training set, and non-correlated pairs (-0.05 < R < 0.05) as 57 true-negative links. The genes involved in these strong correlations were enriched for pathways related to 58 microbial interactions in the airway, including "Asthma & Bronchial Hypersensitivity" and "Respiratory Syncytial 59 Virus Bronchiolitis" (Figure 4B), suggesting that the small set of strong linear correlations were relevant to 60 asthma. 61 62 Next, we trained a random forest classifier on the linear correlations described above. To reduce the noise in 63 the data, the features used as inputs to the classifier were the LDA topics for each gene and microbe 64 ( •, , •, ). That is, for each gene-microbe pair, we concatenated the gene and microbe topics into a single 65 vector (length 20). The Gini index showed the most important features in defining links between genes and 66 microbes were gene topics #7 and #8, and microbe topic #1 (Figure 4 C-F). The genes that comprise the most 67 influential gene topic #8, are enriched for the pathway "Inflammatory Response", and specifically the cytokines 68 IL2 and IL6. It is tempting to speculate that these genes are strong predictors of a link between genes and 69 microbes because they indicate when the presence of a microbe has triggered an inflammatory response. 70

Cross-talk between genes and microbes defined by LDA-link 71
LDA-link identified connections between genes and microbes reported elsewhere in the literature as well as 72 novel observations. A bipartite graph summarizes a subset of the connections, showing in most cases several 73 genes linked to each microbe ( Figure 5A, for a complete list see Supplemental Table X). Notably, both fungi 74 and bacteria showed these links, further highlighting the need to evaluate more than bacteria when performing 75 microbiome experiments in the airway. The gene lactotransferrin was linked to Aeromonas, which has been 76 associated with gastroenteritis and skin infections and has been previously reported to bind lactoferrin [18]. 77 Burkholderia, a gram-negative bacterial genus, is recognized as an important pathogen in the mucus-filled 78 lungs of patients with cystic fibrosis; it was linked to gene MUC6, which encodes a secreted protein 79 responsible for the production of mucin [19]. Haemophilus was observed to be linked to NFKB Inhibitor Zeta, 80 which is induced by the bacterial cell wall component lipopolysaccharide [20]. In addition, Haemophilus was 81 linked to the cytokine interleukin 1 beta (IL1B), an important mediator of the inflammatory response. IL1B 82 hypersensitivity is a hallmark of the asthma phenotype. Pasteurella was also linked to IL1B, and its toxin has 83 been shown to induce expression of IL1B [21]. In addition to single gene-microbe pairs, we layered on pathway 84 and cell deconvolution data to identify larger-scale effects of microbes. 85 86 Microbes were linked to genes that are enriched in pathways relating to auto-immunity and inflammation as 87 well as cytokine receptors and their interactions ( Figure 5B). The microbes associated with cytokine pathways 88 included Synechococcus, Lactococcus,Dialister,Psychrobacter,Moraxella,Brenneria,Proteus,Haemophilus,89 and Pasteurella. In addition, we related the cell-type signatures table (Sf,g) to identify the immune cell types that 90 are related to each microbe ( Figure 5C). We observed the Haemophilus-IL1B linkage in monocytes and mast 91 cells. Samples containing Haemophilus triggered more activated mast cells according to its cell fraction 92 (Figure 5C inset) [22][23][24][25]. Similarly, the fungal genus Candida was linked to the gene GCSAML, which was 93 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint highly expressed by eosinophils. The presence of Candida was associated with increased numbers of 94 Eosinophils in the airway. 95 96

DISCUSSION 97
Heterogeneity and noise are common problems in biological datasets. Heterogeneity can derive from mixtures 98 of different cell types, such as in sputum, or from sparsity, such as in microbiome or single-cell RNAseq data. 99 Unsupervised methods of dimensionality reduction can effectively eliminate these issues, but suffer from 00 decreased interpretability. That is, variables are collapsed together for reasons that are often opaque. 01 Supervised dimensionality reduction maintains interpretability because variables are collapsed using prior 02 knowledge, such as the genes in a pathway or the expression patterns of a cell type. Here, we combined 03 unsupervised and supervised approaches to de-noise the data while retaining interpretability. 04 05 The field is increasingly appreciating the role of the airway microbiome in the development of disease. 06 Commensal microbiota have been shown in other contexts to be strong regulators of host immune system 07 development and homeostasis [26]. Disturbances in the composition of commensal bacteria can result in 08 imbalanced immune responses and affect an individual's susceptibility to various diseases, including those that 09 are inflammatory (e.g., inflammatory bowel disease and colon cancer), autoimmune (e.g., celiac disease and 10 arthritis), allergic (e.g., asthma and atopy), and metabolic (e.g., diabetes, obesity, and metabolic syndrome) 11 (reviewed in [27]). Investigating the microbiota in the lower respiratory tract is a relatively new field in 12 comparison to the extensive work on the intestinal tract. In fact, the lung was excluded from the original Human 13 Microbiome Project because it was not thought to have a stable resident microbiome [11]. A limited number of 14 reports have investigated the changes in the lung microbiota between healthy, non-smoking and smoking 15 individuals as well as in patients suffering from cystic fibrosis, chronic obstructive pulmonary disease, or 16 asthma [2, 28-30]. Despite emerging data on the airway microbiota, little is known about the role of the lung 17 microbiome in modulating pulmonary mucosal immune responses. LDA-link can find relationships between 18 microbes and genes and link them to immune cells and their responses. 19 20 The linkages identified here suggest major processes by which lung immune cells respond to microbes. We 21 found that mast cells respond to Haemophilus and Pasteurella via IL1B and that eosinophils respond to 22 Candida via GCSAML. While experimental validation of these linkages is needed, these results represent 23 observations that would be missed by analyses that do not deconvolve RNAseq data into cell fractions, or that 24 analyze only human RNAseq reads. We expect LDA-link to be broadly useful in relating heterogeneous or 25 noisy RNAseq data. 26

Sample collection and sequencing 28
Sputum induction was performed with hypertonic saline, the mucus plugs were dissected away from the saliva, 29 the cellular fraction was separated, and the RNA was purified as described previously [1]. Briefly, RNA was 30 purified using the All-in-One purification kit (Norgen Biotek) and its integrity was assayed by an Agilent 31 bioanalyzer (Agilent Technologies, Santa Clara, CA). Ribosomal depletion was performed with the RiboGone-32 Mammalian kit (Clontech Cat. Nos. 634846 & 634847 ) and cDNA was created with the SMARTer Stranded 33 RNAseq Kit (Cat. Nos. 634836). Samples were sequenced using an Illumina HiSeq 4000 with 2x125 bp reads, 34 with an average of 47.5 million reads per sample. 35 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint

RNAseq processing by exceRpt 36
An adapted version of the software package exceRpt [7] was used to process the sputum RNAseq data. 37 Briefly, RNAseq reads were subjected to quality assessment using FastQC software v.0.10.1 38 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) both prior to and following 3' adapter clipping. 39 Adapters were removed using FastX v.0.0.13 (http://hannonlab.cshl.edu/fastx_toolkit/). Identical reads were 40 counted and collapsed to a single entry and reads containing N's were removed. Clipped, collapsed reads 41 were mapped directly to the human reference genome (hg19) and pre-miRNA sequences using STAR [31]. 42 Reads that did not align were mapped against a ribosomal reference library of bacteria, fungi, and archaea, 43 compiled by the Ribosome Database Project [32], and then to genomes of bacteria, fungi, plants, and viruses, 44 retreived from GenBank [32]. In cases where RNAseq reads aligned equally well to more than one microbe, a 45 "last common ancestor" approach was used, and the read was assigned to the next node up the phylogenetic 46 tree, as performed by similar algorithms [7,33]. 47 48

Data tables notation 49
We use the following notation to define matrices associated with p patients (115)   Given each patient ( ), all of the genes and microbes were treated like corpus of words in the traditional LDA 86 application. The word ( ) was gene or microbe, and the word count was gene expression or microbe 87 abundances. We built LDA models for genes and microbes, respectively .  88  89  390  391   392  393  94 Given , , , , , N , , , α, β, Z, θ, φ, W, where , , , denote a patient, a word in a document, a topic 95 and a word in the corpus respectively; is the number of documents(patients ), N is the number of words 96 (gene or microbe) in a document, N is the number of topics (set as 10), is the corpus for all the documents; 97 α ( dimensional vector)and β ( -dimensional vector) are the hyper parameters for θ ( × 98 , the distributoin of topics in documents ) and φ ( × , the distribution of word for topics) W is an -99 dimensional vector that denotes the word (gene or microbe expression) in a document (patients). Z is the -00 dimensional vector of integers between 1 and for the topic of word in a document. 01 02 The joint distribution of the LDA model is ( , ; , ) and φ and θ are integrated out as: 03 04 Gibbs sampling equation can be derived from ( , ; , ) to approximate the distribution of ( | ; α, β) 10 because ( ; , ) is invariant to . Given , denotes the topic of the th word token in the th document, 11 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint and also assume that its word symbol is the th word in the vocabulary, the conditional probability can be inferred 12 as follows: After sampling, the expectation of the θ (doc → topic) and φ(topic → word) matrix can be inferred as follows 22 given the symmetric hyper-parameters and were used: 23 24 We instantiated the variables θ and φ to θ , , θ , , and , , , , where θ , , θ , denotes the gene and 28 microbe topic fraction in patient ; , , , denotes the gene and microbe topic. 29

Single-cell RNAseq 30
Sputum cells were separated on a Fluidigm C1 medium-sized channel. The mRNA was purified from 31 approximately 500pg-1ng of total RNA using the Clontech SMARTer Ultra Low RNA Kit and poly-dA-selected 32 using SPRI beads and dT primers. Full-length cDNA was sheared into 200-500bp DNA fragments by 33 sonication ( not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint Gene-microbe correlations with -values less than 1e-5 (absolute correlation greater than 0.4) were chosen as 44 the positive links in a training set. Negative links in the training set were defined as an absolute correlation of 45 less than 0.05. This approach resulted in 302 positive and 650,398 negative links. A random forest algorithm 46 was trained on this set, which can accommodate the highly unbalanced dataset as well as potentially identify 47 non-linear links between genes and microbes. Down-sampling and up-sampling techniques were tested but did 48 not significantly improve the model. In the final model, we adopted the upscaling technique and tested it using 49 cross-validation. The positive dataset was upscaled to very high levels. We use 2-fold cross validation to 50 validate the performance. Simply, we randomly select half training data to train the model, and use the 51 remaining records to test the performance and repeat this for ten times. The AUC and AUPR were 0.994 and 52 0.996 on average, respectively. 53 54 Microbe co-abundance network 55 The raw abundance and LDA microbe topic matrices , which represent the microbe's weight to each 56 topic, were generated. 57 58 The correlation network between different microbes was calculated using Pearson correlation. The cutoff to 59 define a co-abundance edge was 0.8 for R( •, , •, ) and 0.3 for R(, •, . The microbe network modules, which 60 were densely connected themselves but sparsely connected to other modules, were clustered based on 61 between-ness [35] and the other algorithms; we also tested label propagation and fast greedy algorithms [36]. 62 63 We also compared the LDA topics with the microbes in the same clusters. If a microbe was the top 10 most 64 highly contributed for a topic, then we labeled the topic number in the bracket. Some microbes may have 65 multiple topic labels because they highly contribute to multiple topics. 66

DATA ACCESS 67
Sputum bulk-cell RNAseq data can be found under the bioproject SRRXXXXX and sputum single-cell RNAseq 68 data at SRRYYYYYY. 69

ACKNOWLEDGMENTS 70
This work was supported by a National Library of Medicine fellowship to DS (5T15LM007056-28) and an 71 NHLBI grant to GC (1R01HL118346-01). The authors would like to thank the support of the Yale High 72 Performance Computing services (Grace, Ruddle, Farnam) and Yale Center for Genome Analysis.  The heatmap between NMF component and cell fraction from LM22. 14 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint

23.
Baldwin AS, Jr. The NF-kappa B and I kappa B proteins: new discoveries and insights. Annu Rev 10 Immunol. 1996;14:649-83. Epub 199614:649-83. Epub /01/01. doi: 10.1146            . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint B C D A Subset of the gene-microbe linkages defined by the LDA-link model Figure 5 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 11, 2019. ; https://doi.org/10.1101/765297 doi: bioRxiv preprint