IMP: a pipeline for reproducible integrated metagenomic and metatranscriptomic analyses

We present IMP, an automated pipeline for reproducible integrated analyses of coupled metagenomic and metatranscriptomic data. IMP incorporates preprocessing, iterative co-assembly of metagenomic and metatranscriptomic data, analyses of microbial community structure and function as well as genomic signature-based visualizations. Complementary use of metagenomic and metatranscriptomic data improves assembly quality and enables the estimation of both population abundance and community activity while allowing the recovery and analysis of potentially important components, such as RNA viruses. IMP is containerized using Docker which ensures reproducibility. IMP is available at http://r3lab.uni.lu/web/imp/.

of MG and MT data in a tailored way enabling efficient data usage in subsequent steps. 232 233

Assessment of the iterative assembly approach 234
De novo assemblies of MG or MT data usually result in a large fraction of reads that are 235 unmappable to the produced contigs and therefore remain unused, resulting in suboptimal data 236 usage. Previous studies (e.g. Muller, Pinel et al. 2014 [32]; Schürch et al. 2014 [45]; Reyes et 237 al. 2015 [46]) assembled the set of unmappable reads iteratively to successfully obtain 238 additional contigs from these additional rounds of assembly and which lead to an increase in 239 number of predictable genes. 240

241
In order to evaluate the best iterative assembly approach for IMP, we attempted to 242 determine the opportune number of assembly iterations in relation to assembly quality metrics. 243 The evaluation involved performing multiple iterations of recruiting unmappable reads to the 244 previously generated non-redundant assembly, followed by a de novo assembly of those 245 unmapped reads. The assembly from a given iteration was then merged with the previous 246 assembly to reduce the redundancy (refer to section Iterative single-omic assemblies for 247 details). The evaluation of additional assembly iterations for MG data of SM, HF and WW are 248 summarized in Fig. 2, based on four different metrics. Overall, each iteration on each of the 249 different datasets (SM, HF and WW) lead to an increase in total length of assembly and 250 increased the overall number of mappable reads, but differed in the observed gain of contigs 251 and genes ( Fig. 2; and Additional file 3: Table S2). A similar trend is noticeable for iterative 252 assemblies on MT data (see Additional file 2: Figure S2 and Additional file 3: Table S3). The 253 observed trends may be explained by the fact that the complexity of the data typically 254 confounds assemblies [44]. The exclusion of mappable reads in each iteration of assembly reduces the complexity of the data, which in turn allows additional contigs to be assembled and 256 results in a higher cumulative output [44]. 257 258 Considering the relatively low increase in longer contigs and genes beyond the first 259 assembly iteration (Fig. 2, Additional file 2: Figure S2 and Additional file 3: Table S2 and S3) 260 and the extended runtimes required to perform additional assembly iterations, an opportune 261 single iteration assembly approach was implemented in the workflow of IMP. This approach, 262 which balances the maximization of output yield with runtime, is implemented within the IMP-263 based co-assembly approach. 264

265
Benchmarking 266 The iterative co-assemblies were benchmarked against single-omic MG and MT assemblies 267 and co-assemblies obtained using the state-of-the-art MG data analysis pipeline, MetAMOS 268 [12]. 269

270
Single-omic assemblies and multi-omic iterative co-assemblies 271 Separate single-omic iterative assemblies on all datasets were generated using the preprocessed 272 MG and MT data (see Iterative single-omic assemblies). The iterative co-assemblies were 273 executed in IMP using the two available assembler options for the co-assembly step, i.e. the 274 default IDBA-UD [15] (hereafter referred to as IMP) and the optional MEGAHIT assembler 275 [16] (referred to as IMP-MEGAHIT). Both assemblers are regarded as state-of-the-art because 276 they perform assemblies on multiple kmer sizes, while MEGAHIT was also chosen due to its 277 superior speed and efficient memory usage [15,16]. 278 contigs, increased total length of assembled contigs and a higher number of predicted genes 282 (partial genes included) compared to single-omic assemblies for all datasets ( Table 1). The 283 apparent slight reduction in contiguity (N50 statistic) is due to the addition of shorter contigs 284 likely stemming from the increased sequencing depth of the combined MG and MT datasets, 285 which also increases the complexity of the assembly process. By using the reference genomes 286 from the SM data as ground truth, an improved recovery of reference genome fractions is 287 apparent for the IMP-based co-assemblies. Importantly, a significant increase in the number of 288 mappable MG and MT reads was observed within all co-assemblies compared to the respective 289 single-omic assemblies ( Table 1) which suggests superior data usage using the IMP-based 290 approach. For example, the IMP-based iterative co-assemblies resulted in a large fraction of 291 reads being mappable back to the contigs derived from the HF sample (average of 292 approximately 88.0 % and 96.3 % for the MG and MT reads, respectively; Table 1), which is 293 substantially higher compared to the numbers reported in a previous report in which MG 294 sequencing data was mapped to an integrated gene catalog, i.e. 74%-81% [11]. In summary, 295 the complementary use of MG and MT data in the context of de novo co-assembly results in 296 an increased yield of output, while enhancing overall data usage for subsequent analyses. 297 298 Quality assessment of the IMP-based iterative co-assembly procedure 299 Iterative co-assemblies using IMP (referred to as IMP and IMP-MEGAHIT, see section above) 300 and co-assemblies from MetAMOS [12] on all datasets were compared against each other. Additional file 2: Figure S3 and Additional file 3: Table S4 for detailed comparison and 309 results). 310 311 Based on the SM data ( Fig. 3A), IMP and MetAMOS-IDBA_UD performed similarly 312 for most measures but the IMP-based assemblies producing slightly better contiguity (N50 313 statistic) and lower levels of apparent misassembly. Despite MetAMOS producing the largest 314 number of contigs ≥ 1kb, its comparatively low N50 statistic indicates a highly fragmented 315 assembly, which is further reflected in the low number of predicted unique genes. Conversely, 316 this fragmented assembly is accompanied by relatively low misassembly rate, reinforcing the 317 notion that shorter contigs are less prone to misassemblies [44]. However, longer contigs (≥ 318 1kb) are a prerequisite for population-level genome reconstruction and subsequent multi-omic 319 data interpretation. On the other hand, IMP-MEGAHIT generated the highest number of 320 predicted unique genes, recovered the largest fraction of reference genomes, while yielding 321 comparatively large number of contigs ≥ 1kb, relatively high N50 statistic and a low rate of 322 misassembly. The assessment based on real datasets shows comparable performance between 323 IMP, IMP-MEGAHIT and MetAMOS-IDBA_UD ( Fig. 3B and C). In general, MetAMOS 324 produced highly fragmented assemblies for the real datasets, possibly due to the single kmer 325 length (k = 31) assemblies, which tend to produce relatively fragmented assemblies compared 326 to multiple kmer length assemblers (as in IMP, IMP-MEGAHIT and MetAMOS-IDBA_UD) 327 [23]. In summary (Fig. 3D), IMP and MetAMOS-IDBA_UD performed similarly for most 328 metrics when the same assembly program (IDBA-UD) was used by both pipelines. However, To further demonstrate the unique analytical capabilities of IMP, we set out to identify 380 populations with a high transcriptional activity in the HF sample. Average depth of coverage 381 at the contig-and gene-level is a common measure used to evaluate the abundance of microbial 382 populations within communities [25,27,32]. Integrative analysis of MG and MT data by IMP 383 further extends this measure by calculation of average MT to MG depth of coverage ratios, 384 which provides information on transcriptional activity and can be visualized using augmented 385 VizBin maps. 386 387 One particular cluster of contigs within the augmented VizBin maps displayed high MT 388 to MG depth of coverage ratios (see Additional file 2: Figure S4). The subset of contigs 389 (simplified as subset) within the selected cluster aligned to the genome of the Escherichia coli 390 P12B. For comparison, we also identified a subset which was highly abundant at the MG level 391 (lower MT to MG ratio), which aligned to the genome of Collinsella intestinalis DSM 13280 392 strain. Based on these observations, we highlighted these subsets of contigs to produce an 393 augmented VizBin map (Fig. 5A). In these, C. intestinalis and E. coli subsets are mainly 394 represented by clear peripheral clusters which exhibit consistent intra-cluster MT to MG depth 395 of coverage ratios (Fig. 5A). The subsets were manually inspected in terms of their distribution 396 of average MG and MT depths of coverage, comparing them against the corresponding 397 distributions of all the contigs. The MG-based average depths of coverage of the contigs from 398 the entire community exhibited a bell-shape like distribution, with a clear peak (mode). On the 399 contrary, MT depths of coverage exhibited a spread distribution, with a relatively lower mean 400 (compared to MG distribution) and no clear peak (Fig. 5B). The C. intestinalis subset displays 401 similar distributions to that of the entire community, whereas the E. coli subset exhibits an 402 unusually high MT-based depth of coverage, and a low MG-based depth of coverage (Fig. 5B). 403 Further inspection of the individual omic datasets revealed that the E. coli subset was not 404 the original study by Franzosa et al. (2014) [29], the cDNA conversion protocol used to 408 produce the MT data is known to introduce approximately 1-2 % of E. coli genomic DNA into 409 the cDNA as contamination which is then reflected in the MT data. According to our analyses, 410 0.12 % of MG reads and 1.95 % of MT reads from this sample could be mapped onto the E. MT data for studying microbial community structure and function in situ [4,6]. Accordingly, 420 we present a self-contained workflow which performs reproducible integrative analyses of 421 coupled MG and MT data derived from single and unique microbial community samples. IMP 422 encapsulates all processes including preprocessing, assembly, and analyses within an 423 automated reproducible pipeline. 424

425
We implemented customized preprocessing and filtering procedures for MG and MT 426 data due to the distinct nature of these different omic data types. We also evaluated the IMP-427 based iterative co-assembly procedure and found it to produce higher amount of output volume a higher fraction of read which can be mapped back to the contigs). IMP provides the option 430 for the use of two state-of-the-art assemblers, whereby the default assembler, IDBA-UD, 431 produces highly contiguous assemblies, while MEGAHIT balances the number of contigs with 432 favorable contiguity with a high number of predicted genes and a relatively low rate of 433 misassemblies. High quality assemblies yield better quality taxonomic information and gene 434 annotations. Consequently, the post-assembly analysis of MG and MT data enables users to 435 evaluate their community of interest based on taxonomy, functional potential and functional 436 expression. The integrated co-assembly also provides the opportunity for analyses not possible 437 based on MG data alone, such as the detection of RNA viruses and the identification of

Output 580
The output generated by IMP includes a multitude of large files. Paired-and single-end FASTQ 581 files of preprocessed MG and MT reads are provided such that the user may employ them for 582 additional downstream analyses. The output of the IMP-based iterative co-assembly consists 583 of a FASTA file, while the alignments/mapping of MG and MT preprocessed reads to the final 584 co-assembly are also provided as a binary alignment format (BAM), such that users may use 585 these for further processing.