Pipeline implementation
The PEPPRO pipeline is a python script (peppro.py) runnable from the command-line. PEPPRO provides restartability, file integrity protection, logging, monitoring, and other features. Individual pipeline settings can be configured using a pipeline configuration file (peppro.yaml), which enables a user to specify absolute or relative paths to installed software and parameterize alignment and filtering software tools. Required software includes several Python packages (cutadapt [28], looper, numpy [29], pandas [30], pararead, pypiper, and refgenie [31]) and R packages (installed via the included PEPPROr R package) in addition to some common bioinformatics tools including bedtools [32], bigWigCat [33], bowtie2 [34], fastq-pair [35], flash [36], picard, preseq [20], seqkit [18], samtools [37], seqtk, and wigToBigWig [33]. This configuration file will work out-of-the-box for research environments that include required software in the shell PATH, but may be configured to fit any computing environment and is adaptable to project-specific parameterization needs.
Refgenie reference assembly resources
Several PEPPRO steps require generic reference genome assembly files, such as sequence indexes and annotation files. For example, alignment with bowtie2 requires bowtie2 indexes, and feature annotation to calculate fraction of reads in features requires a feature annotation. To simplify and standardize these assembly resources, PEPPRO uses refgenie. Refgenie is a reference genome assembly asset manager that streamlines downloading, building, and using data files related to reference genomes [31]. Refgenie includes recipes for building genome indexes and genome assets as well as downloads of pre-indexed genomes and assets for common assemblies. Refgenie enables easy generation of new standard reference genomes as needed. For a complete analysis, PEPPRO requires a number of refgenie managed assets. Those assets as defined by refgenie are fasta, bowtie2_index, ensembl_gtf, ensembl_rb, refgene_anno, and feat_annotation. If building these assets manually, they separately require a genome fasta file, a gene set annotation file from RefGene, an Ensembl gene set annotation file in GTF format, and an Ensembl regulatory build annotation file. Finally, using PEPPRO with seqOutBias requires the additional refgenie tallymer_index asset of the same read length as the data.
Adapter-adapter ligation product abundance
Adapter-adapter ligation products show up in run-on libraries because there are two independent ligation steps. Sequencing these products is uninformative, and so there are several molecular approaches used to reduce their abundance in a sequencing library. All protocols include an inverted dT on the 3′ end of the 3′ adapter and also do not phosphorylate the 5′ end of the 5′ adapter. Many protocols include a size-selection gel extraction step to purify the library from a prominent adapter-adapter ligation species.
PEPPRO calculates adapter-adapter ligation products directly from cutadapt output, and the default -m value for this step is the length of the UMI plus two nucleotides. Therefore, if RNA insertions fewer than three nucleotides in length are present in the library, these are treated as adapter-adapter ligation products.
RNA insert size distribution and degradation
For both single- and paired-end data, the RNA insert size distribution is calculated prior to alignment. For single-end data, the calculation is derived only from sequences that contain adapter sequence, which is output directly from cutadapt [28]. PEPPRO plots the inverse cutadapt report fragment lengths against the cutadapt fragment counts. If there is a known UMI, based on user input, that length is subtracted from reported cutadapt fragment lengths. As a consequence of this distribution, we can establish a measure of library integrity by evaluating the sum of fragments between 10 and 20 bases versus the sum of fragments between 30 and 40 bases in length. The higher this degradation ratio, the more degraded the library.
Paired-end sequencing files often have shorter reads because a standard 75 base sequencing cartridge can be used for two paired-end reads that are each 38 nucleotides in length. Therefore, many fewer of the reads derived from either end of the molecule extend into the adapter sequence. To address this issue, we incorporate a step that fuses overlapping reads using flash [36]. Therefore, if two paired-end reads contain overlapping sequence, the reads are combined and the insert size is calculated directly from the fused reads and output directly from flash. This distribution is plotted identically to the single-end reads and degradation is calculated in the same manner. This degradation ratio metric is uniform between single-end or paired-end libraries and is reported prior to any alignment steps, minimizing influences from extensive file processing or alignment eccentricities.
Excluding size selection skews metrics
Recent PRO-seq protocols, including the H9 libraries we generated, exclude the PAGE size selection step that removes adapter-adapter ligation products [15]. Size selection can potentially bias against small RNA insertions. The previous two metrics, adapter-adapter abundance and degradation ratio, are naturally skewed toward the undesirable range if libraries are constructed without size selection. Adapter abundance is skewed because the sole purpose of size selection is to remove the adapter species, but these uninformative reads are of minimal concern and can be overcome by increasing sequencing depth. Degradation ratio is skewed higher because the size selection is not perfect and insert sizes in the range of 10–20 are preferentially selected against relative to those in the 30–40 range. Therefore, while we provide recommendations for optimal degradation ratios, this metric is not necessarily comparable between library preparation protocols and a higher ratio is expected for protocols that exclude size selection.
Removing UMI and reverse complementation
In a typical sequencing library, low library complexity is indicated by high levels of PCR duplicates. Conventional methods remove independent paired-end reads that map to the same genomic positions. This method works reasonably well for molecular genomics data sets with random nucleic acid cleavage. However, in PRO-seq, transcription start sites account for many of the 5′ RNA ends and polymerases pause downstream in a focused region [4]. Consequently, independent insertions with the same end points are common, especially in the promoter-proximal region. To solve this, PRO-seq protocols incorporate a unique molecular identifier (UMI) into the 3′ adapter to distinguish between PCR duplicates and independent insertions with shared ends. PEPPRO removes PCR duplicates only if UMIs are provided.
Following the removal of PCR duplicates, the UMI is trimmed. For run-on experiments where the sequencing primer sequences the 3′ end of the original RNA molecule, reverse complementation is performed. As only the first read contains a UMI in paired-end experiments, the second reads skip UMI trimming. Both steps are performed using either seqtk (https://github.com/lh3/seqtk) or fastx (https://github.com/agordon/fastx_toolkit), depending on user preference. Because reads are processed uniquely for first and second reads in a paired-end experiment, reads must be re-paired prior to alignment. PEPPRO uses the optimized implementation fastq-pair [35] to re-pair desynchronized read files.
Serial alignments
Following re-pairing, or starting from processed single-end reads, PEPPRO performs a series of preliminary, serial alignments (prealignments) before aligning to the primary reference using bowtie2 [34]. As a significant portion of nascent transcription includes rDNA, PEPPRO defaults to initially aligning all reads to the human rDNA sequence. Not only does this remove rDNA reads from downstream analysis, it improves computational efficiency by aligning the largest read pool to a small genome and reduces that read pool for subsequent steps. The user can specify any number of additional genomes to align to prior to primary alignment, which may be used for species contamination, dual-species experiments, repeat model alignments, decoy contamination, or spike-in controls. For serial alignments, bowtie2 is run with the following parameters -k 1 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50, where we are interested primarily in quickly identifying and removing any reads that have a valid alignment to the serial alignment genome (-k 1 parameter). These settings are easily adjusted in the pipeline configuration file (peppro.yaml).
Subsequent to these serial alignments, remaining reads are aligned to the primary genome. Primary genome alignment uses the bowtie2--very-sensitive option by default and sets the maximum paired-end fragment length to 2000. The goal with primary alignment is to identify the best valid alignment for reads, sacrificing speed for accuracy. Following primary alignment, low-quality reads are removed using samtools view -q 10. As with the initial prealignments, these parameters can be customized by the user in the pipeline configuration file (peppro.yaml). Alignment statistics (number of aligned reads and alignment rate) for all serial alignments and primary alignments are reported. For the primary alignment, PEPPRO also reports the number of mapped reads, the number removed for quality control, the total efficiency of alignment (aligned reads out of total raw reads), and the read depth. Prior to further downstream analysis, paired-end reads are split into separate read alignment files and only the first read is retained for downstream processing. For both paired-end and single-end experiments, this aligned read file is split by strand with both plus and minus strand aligned files further processed.
Processed signal tracks
Following read processing, alignment, strand separation, and quality control reporting, aligned reads are efficiently converted into strand-specific bigWig files by default. For PRO-seq and similar protocols, reads are reported from the 3′ end and may optionally be scaled by total reads. PEPPRO may alternatively use seqOutBias [19] to correct enzymatic sequence bias. Bias is corrected by taking the ratio of genome-wide observed read counts to the expected sequence based counts for each k-mer [19]. K-mer counts take into account mappability at a given read length using Genome Tools’ Tallymer program [38]. Correcting for enzymatic bias can be important as bias from T4 RNA Ligase used in PRO-seq protocols can yield erroneous conclusions [19]. As such, we recommend using seqOutBias for bias correction when analyzing a typical PRO-seq library. Bias correction is especially important when plotting composite profiles over sequence features. Strand-specific bigWigs may be visually analyzed using genomic visualization tools and provide a unified starting point for downstream analyses. For example, output bigWig files can be directly loaded into dREG to identify regulatory elements defined by bidirectional transcription [1].
Exon-intron ratio plots
PEPPRO provides an mRNA contamination histogram for quick visual quality control, and a BED format file containing gene by gene exon:intron ratios for detailed analysis. To calculate this metric, PEPPRO utilizes annotation files derived from UCSC RefSeq gene files. Because promoter-proximal pausing inflates these ratios, PEPPRO excludes the first exon from the calculation. Otherwise, the reads per kilobase per million mapped reads (RPKM) is calculated for all exonic and intronic sequences on a gene by gene basis. Then, the ratio of exon RPKM to intron RPKM is determined for every gene. The overall measure, the mRNA contamination metric, is the median of all genic exon to intron density ratios.
Pause index
Pause indices are calculated as the ratio of read density in the promoter proximal region versus read density in the gene body. To calculate these values, PEPPRO utilizes annotation files derived from Ensembl gene set files. Pause indices can vary widely depending on the defined pause window and how a pause window is determined (i.e., relative to a TSS or the most dense window proximal to a TSS). PEPPRO defines the density within the pause region as the single, most dense window +20–120 bp taken from all annotated TSS isoforms per gene. This is necessary as some genes contain multiple exon 1 annotations and because this region is where most polymerase pausing occurs, PEPPRO identifies the predominant exon 1, based on density, and calculates the pause index using this window density. This means that for genes with multiple TSSs, we define the pause window as the region +20-120 bases from each identified TSS per gene. We determine the read density at every annotated pause window per gene and identify the predominant, singular pause window as the pause window with the greatest density. This singular pause window is used to calculate the pause index for that gene. The corresponding gene body is defined as the region beginning 500-bp downstream from the predominant TSS to the gene end. We found that lowly expressed genes represent a significant portion of genes with a low pause index. At low sequencing depth, these lowly expressed genes experience greater dropout and fluctuation in pause index calculation, skewing the metric upwards at low depth. To address this, we restrict the pause index calculation to the upper 50th percentile of genes by expression, which eliminates the variability due to depth. Finally, PEPPRO plots the distribution of pause indices for each remaining gene in a histogram and provides a BED-formatted file containing each gene’s pause index for more detailed analyses.
PRO-seq experiments
H9 PRO-seq experiments were conducted as described previously [15]. The HDACi-treated samples were incubated with 200nM romidepsin for 60 min prior to harvesting. The control “untreated” samples were treated with DMSO for 60 min. We have included these samples as a test to demonstrate differential expression analysis using PEPPRO. They also provide additional example libraries for the metrics in general and, unexpectedly, show significant differences in pause index upon treatment.
Synthetic experiments
Synthetic sequencing depth variant libraries were constructed for single-end and paired-end PRO-seq libraries using either the K562 PRO-seq (GSM1480327) or H9 PRO-seq 2 (GSM4214081) as source libraries, respectively. For K562 PRO-seq subsamples, seqtk sample -s99 was called on the raw fastq files to generate libraries between 2 and 10%, in 2% increments, and between 10 and 100%, in 10% increments. For the H9 PRO-seq libraries, seqtk sample -s99 was called on the raw fastq files to produce libraries between 10 and 100%, in 10% increments. Lower percentage K562 PRO-seq libraries were generated to yield libraries of total size comparable to low percentage H9 PRO-seq libraries.
RNA-seq spike-in libraries were also produced using the command seqtk sample -s99 on raw fastq files using combinations of the K562 PRO-seq library utilized prior and a corresponding K562 RNA-seq library (GSM765405). RNA-seq libraries were sampled between 10 and 100%, in 10 percent increments, and concatenated with the sampled K562 PRO-seq libraries to generate mixed libraries composed of 0–100% RNA-seq.
Low complexity libraries were similarly constructed. Thirty million total read libraries were generated by using seqtk sample -s99 on the H9 PRO-seq 2 library and sampling at 50, 80, 90, 92, 94, 96, 98, and 100%. At each percentage of original H9 PRO-seq 2 library sample, the remainder represents duplicates of the original raw reads composing the opposite percentage, producing libraries with varying levels of duplicated reads.