Comprehensive analysis of RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues

High throughput RNA sequencing technologies have provided invaluable research opportunities across distinct scientific domains by producing quantitative readouts of the transcriptional activity of both entire cellular populations and single cells. The majority of RNA-Seq analyses begin by mapping each experimentally produced sequence (i.e., read) to a set of annotated reference sequences for the organism of interest. For both biological and technical reasons, a significant fraction of reads remains unmapped. In this work, we develop Read Origin Protocol (ROP) to discover the source of all reads originating from complex RNA molecules, recombinant T and B cell receptors, and microbial communities. We applied ROP to 8,641 samples across 630 individuals from 54 tissues. A fraction of RNA-Seq data (n=86) was obtained in-house; the remaining data was obtained from the Genotype-Tissue Expression (GTEx v6) project. To generalize the reported number of accounted reads, we also performed ROP analysis on thousands of different, randomly selected, and publicly available RNA-Seq samples in the Sequence Read Archive (SRA). Our approach can account for 99.9% of 1 trillion reads of various read length across the merged dataset (n=10641). Using in-house RNA-Seq data, we show that immune profiles of asthmatic individuals are significantly different from the profiles of control individuals, with decreased average per sample T and B cell receptor diversity. We also show that immune diversity is inversely correlated with microbial load. Our results demonstrate the potential of ROP to exploit unmapped reads in order to better understand the functional mechanisms underlying connections between the immune system, microbiome, human gene expression, and disease etiology. ROP is freely available at https://github.com/smangul1/rop and currently supports human and mouse RNA-Seq reads.


INTRODUCTION 74
Advances in RNA sequencing (RNA-seq) technology have provided an unprecedented 75 opportunity to explore gene expression across individual, tissues, and environments 76 Microbial reads can be introduced by contamination or natural microbiome content in 154 the sample, such as viral, bacterial, fungi, or other microbial species (Salter et al., 2014).  Strauli & Hernandez, 2015). However, we find that performing 162 a sequential analysis, in the order described above, is critical for minimizing 163 misclassification of reads due to homologous sequences between the different classes. and repetitive (green colors) categories. Unmapped reads that fail to map are extracted 176 and further filtered to exclude low quality reads, low complexity reads, and reads from 177 rRNA repeats (brown color). ROP protocol is able to identify unmapped reads aligned to 178 human references with use of a more sensitive alignment tool (lost human reads: red 179 color), unmapped reads aligned to human references with excessive ('hyper') editing 180 To test ROP, we applied it to one trillion RNA-Seq reads across 54 tissues from 2630 190 individuals. The data was combined from 3 studies: (1) in-house RNA-Seq data (n=86) 191 from the peripheral blood, nasal, and large airway epithelium of asthmatic and control 192 individuals (S1); (2) multi-tissue RNA-Seq data from Genotype-Tissue Expression (GTEx v6) 193 from 53 human body sites (Consortium & others, 2015) (n=8555) (S2); (3) randomly 194 selected RNA-Seq samples from the Sequence Read Archive (SRA) (n=2000) (S3). Unless 195 otherwise noted, we reported percentage of reads averaged across 3 datasets.
RNA-Seq data obtained from the three sources represent a large collection of 197 tissue types and read diversity. We selected these three sources to most accurately model 198 the precision and broad applicability of ROP. The in-house RNA-Seq data was collected 199 from 53 asthmatics and 33 controls. RNA-seq libraries were prepared from total RNA with 200 two types of RNA enrichment methods: (1) Poly(A) enrichment libraries, applied to RNA 201 from peripheral blood and nasal epithelium (n=38), and (2) ribo-depletion libraries, 202 applied to RNA from large airway epithelium (n=49). The GTEx dataset was derived from 203 38 solid organ tissues, 11 brain subregions, whole blood, and three cell lines across 544 204 individuals. Randomly selected SRA RNA-Seq samples included samples from whole 205 blood, brain, various cell lines, muscle, and placenta. Length of reads from in-house data 206 was 100bp, read length in Gtex data was 76bp, read length in SRA data ranged from 36bp 207 to 100bp. In total, 1 trillion reads (97 Tbp) derived from 10641 samples were available for 208 Table S1 and Supplementary Methods). For counting purposes, the 209 pairing information of the reads is disregarded, and each read from a pair is counted 210 separately. 211

212
We used standard read mapping procedures to obtain mapped and unmapped 213 reads from all three data sources. Read mapping for GTEx data was performed by the 214 GTEx consortium using TopHat2 (Kim et al., 2013). Following the GTEx consortium 215 practice, we used TopHat2 to map reads from in-house and SRA studies. High-throughput 216 mapping using TopHat2 (Kim et al., 2013) recovered 83.1% of all reads from three studies 217 ( Fig. 2.a), with the smallest fraction of reads mapped in the SRA study (79% mapped reads). From the unmapped reads, we first excluded low-quality/low-complexity reads 219 and reads mapping to the rRNA repeating unit, which together accounted for 7.0% and 220 2.4% of all reads, respectively ( Fig. 2.b). We were then able to align unmapped reads to 221 human reference sequences (5.7% of all reads, Fig. 2.c) and identify "hyper-edited" reads 222    Table S4). We observe that an alternative 261 aligner and a more sensitive mapping setting has no substantial effect on the number of 262 mapped reads (Supplementary Table S5). This is in line with Baruzzo et al., 2016, which 263 have shown that optimizing the parameters of RNA-Seq aligner is a non-trivial task and 264 methods with good performance for the default setting is a preferred choice. We have identified hyper-edited reads by using the pipeline proposed in (Porath,295 Carmi, & Levanon, 2014). This hyper-editing pipeline transforms all As into Gs, in both 296 the unmapped reads and the reference genome, and the pipeline realigns the 297 transformed RNA-Seq reads and the transformed reference genome. The method then 298 recovers original sequences and searches for dense clusters of A-to-G mismatches. 299 A total of 201,676,069 hyper-edited reads were identified across all samples 300 from the three studies. As a control for the detection, we calculated the prevalence of all 301 6 possible nucleotide substitutions and found that 79.9% (201,676,069/252,376,867) of 302 the detected reads were A-to-G mismatches (Supplemental Fig. S8). In comparison, the 303 in-house RNA-Seq samples have a 96.1% rate of A-to-G mismatches. This massive over-304 representation of mismatches strongly suggests that these reads resulted from ADAR 305 mediated RNA editing. However, additional experiments are required to confirm the 306 nature of these edits. In addition, we found that the nucleotide sequence context of the 307 detected editing sites complies with the typical sequence motif of ADAR targets 308 (Supplemental Fig. S9). 309 310

The ROP protocol complements transcriptome profiling by non-co-linear RNAs 311
The ROP protocol is able to detect 'non-co-linear' reads via Tophat-Fusion ( supported by more than one read. Over 90% of non-co-linear events were supported by 318 fewer than 10 samples (Supplemental Fig. S10). We used a liberal threshold, based on number of reads and individuals, because our interest is mapping all reads. However, a 320 more stringent cut off is recommended for confident identification of non-co-linear 321 events, specially in the clinical settings. 322 Based on the in-house RNA-Seq data, we observe that the library preparation 323 technique strongly affects the capture rate of non-co-linear transcripts. To compare the 324 number of NCL events, we sub-sampled unmapped reads to 4,985,914 for each sample, 325 which corresponded to the sample with the smallest number of unmapped reads among 326 in-house RNA-Seq samples. We observed an average increase of 92% of circRNAs in 327 samples prepared by ribo-depletion compared to poly(A) protocol (p-value = 3 x 10 -12 ) 328 (Supplemental Fig. S11). At the same time, we observed an average 43% decrease of 329 trans-splicing and fusion events in samples prepared by ribo-depletion compared to 330 poly(A) protocol (p-value < 8 x 10 -4 ) (Supplemental Fig. S11). However, because the 331 tissues differed between protocols (e.g., nasal versus large airway epithelium), this effect 332 might be due in part to tissue differences in NCL events. We view the tissue differences 333 effect to be unlikely. We previously showed that gene expression profiles of nasal airway 334 tissue largely recapitulate expression profiles in the large airway epithelium tissue (Poole 335 et al., 2014). 336 Furthermore, many NCL events will not be captured by poly-A selection. 337 Therefore, we expect systematic differences in NCL abundance between capture 338 methods. There were no statistically significant differences (p-value > 5x10 -3 ) between 339 NCL events in cases and controls. We have compared number of NCL reads across GTEx 340 tissue, and we observe the highest fraction of NCL reads across pancreas samples with 0.75% of reads classified as NCL reads. In all other tissue types, ROP classified 342 approximately .3% reads as NCL reads (Supplemental Figure S12). proportions into a single diversity metric. We observed a mean alpha diversity of .7 among 380 all the samples for immunoglobulin kappa chain (IGK). Spleen, minor salivary gland, and 381 small intestine (terminal ileum) were the most immune diverse tissue, with corresponding 382 IGK alpha diversity of 86.9, 52.05, and 43.96, respectively (Supplemental Fig. S14-S15). (Pearson coefficient r = -0.55, p-value = 2.4 x 10 -6 ), consistent also for bacteria and 407 eukaryotic pathogens across BCR and TCR loci (Supplemental Fig. S17). 408 409 Using in-house data, we compared alpha diversity of asthmatic individuals (n = 9) 410 and healthy controls (n = 10). The combinatorial profiles of B and T cell receptors in blood 411 and large airway tissue provide no differentiation between case control statuses. Among 412 nasal samples, we observed decreased alpha diversity for asthmatic individuals relative 413 to healthy controls (p-value = 10 -3 ) (Fig. 2.b). Additionally, we used beta diversity 414 (Sørensen-Dice index) to measure compositional similarities between samples, including 415 gain or loss of VJ combinations of IGK locus. We observed higher beta diversity 416 corresponding to a lower level of similarity across the nasal samples of asthmatic 417 individuals in comparison to samples from unaffected controls (Fig. 2.c, p-value < 3.7x10 -418 13 ). Moreover, nasal samples of unaffected controls are significantly more similar than 419 samples from the asthmatic individuals (Fig. 2.c, p-value < 2.5x10 -9 ). Recombination 420 profiles of immunoglobulin lambda locus (IGL) and T cell receptor beta and gamma (TCRB 421 and TCRG) loci yielded a similar pattern of decreased beta diversity across nasal samples 422 of asthmatic individuals (Supplemental Fig. S18-S20). Together the results demonstrate 423 the ability of ROP to interrogate additional features of the immune system without the 424 expense of additional TCR/BCR sequencing. Our study is the first that systematically accounts for almost all reads, totaling one trillion, 447 available via three RNA-seq datasets. We demonstrate the value of analyzing unmapped 448 reads present in the RNA-seq data to study the non-co-linear, RNA editing, 449 immunological, and microbiome profiles of a tissue. We developed a new tool (ROP) that 450 accounts for 99.9% of the reads, a substantial increase compared over the 82.2% of reads 451 account for using conventional protocols. We found that the majority of unmapped reads 452 are human in origin and from diverse sources, including repetitive elements, A-to-I RNA 453 editing, circular RNAs, gene fusions, trans-splicing, and recombined B and T cell receptor 454 sequences. In addition to those derived from human RNA, many reads were microbial in 455 origin and often occurred in numbers sufficiently large to study the taxonomic 456 composition of microbial communities in the tissue type represented by the sample. 457

458
We found that both unmapped human reads and reads with microbial origins are 459 useful for differentiating between type of tissue and status of disease. For example, we 460 found that the immune profiles of asthmatic individuals have decreased immune diversity when compared to those of controls. Further, we used our method to show that immune 462 diversity is inversely correlated with microbial load. This case study highlights the 463 potential for producing novel discoveries, when the information in RNA-seq data is fully 464 leveraged by incorporating the analysis of unmapped reads, without need for additional 465 TCR/BCR or microbiome sequencing. The ROP profile of unmapped reads output by our 466 method is not limited to RNA-Seq technology and may apply to whole-exome and whole-467 genome sequencing. We anticipate that ROP profiling will have broad future applications 468 in studies involving different tissue and disease types. We first converted the unmapped reads saved by TopHat2 from a BAM file into a FASTQ 547 file (using samtools with bam2fq option). The FASTQ file of unmapped reads contains full read pairs (both ends of a read pair were unmapped) and discordant read pairs (one read 549 end was mapped while the other end was unmapped). We disregarded the pairing 550 information of the unmapped reads and categorized unmapped reads using the 551 protocol's seven steps. Reads identified at each step are filtered out. 552 553 A. Quality Control. Low quality reads, defined as reads that have quality lower than 30 in 554 at least 75% of their base pairs, were identified by FASTX (v 0.0.13). Low complexity reads, 555 defined as reads with sequences of consecutive repetitive nucleotides, were identified by 556 SEQCLEAN. As a part of the quality control, we also excluded unmapped reads mapped 557 Tophat-Fusion (v2.0.13, bowtie1 v0.12.) and allows simultaneous monitoring of NCL 580 events in the same run. To extract trans-spicing and gene fusion events from the TopHat-581 Fusion output, we ran a ruby custom script, which is part of the ROP pipeline (NCL.rb).

Reference databases 599
A detailed description of reference databases used by ROP is provided in Supplemental 600

Comparing diversity across groups 602
First, we sub-sampled unmapped reads by only including reads corresponding to a sample 603 with the smallest number of unmapped reads. Diversity within a sample was assessed 604 using the richness and alpha diversity indices. Richness was defined as a total number of 605 distinct events in a sample. We used Shannon Index (SI), incorporating richness and 606 evenness components, to compute alpha diversity, which is calculated as follows: 607 SI = − × log -608 We used beta diversity (Sørensen-Dice index) to measure compositional similarities 609 between the samples in terms of gain or loss in events. We calculated the beta diversity 610 for each combination of the samples, and we produced a matrix of all pairwise sample 611 dissimilarities. The Sørensen-Dice beta diversity index is measured as 1 − We investigated the impact of alternative aligners (STAR, 626 https://github.com/alexdobin/STAR) and carefully adjusted the mapping setting to 627 achieve sensitive and very sensitive settings (Supplemental Table S4). The average 628 runtime on Hoffman2 Cluster for Tophat per million reads was 2.5 hours; STAR, 0.13 629 hours; and Novoalign, 9.1 hours. Novoalign was not considered in the analysis due to its 630 substantially longer running time, which made it infeasible for the protocol.