STARRPeaker: uniform processing and accurate identification of STARR-seq active regions

STARR-seq technology has employed progressively more complex genomic libraries and increased sequencing depths. An issue with the increased complexity and depth is that the coverage in STARR-seq experiments is non-uniform, overdispersed, and often confounded by sequencing biases, such as GC content. Furthermore, STARR-seq readout is confounded by RNA secondary structure and thermodynamic stability. To address these potential confounders, we developed a negative binomial regression framework for uniformly processing STARR-seq data, called STARRPeaker. Moreover, to aid our effort, we generated whole-genome STARR-seq data from the HepG2 and K562 human cell lines and applied STARRPeaker to comprehensively and unbiasedly call enhancers in them. Supplementary information The online version contains supplementary material available at 10.1186/s13059-020-02194-x.


Background
The transcription of eukaryotic genes is precisely coordinated by an interplay between cis-regulatory elements. For example, enhancers and promoters serve as binding platforms for transcription factors (TFs) and allow them to interact with each other via three-dimensional looping of chromatin. Their interactions are often required to initiate transcription [1,2]. Enhancers, which are often distant from the transcribed gene body itself, play critical roles in the upregulation of gene transcription. Enhancers are cell-type-specific and can be epigenetically activated or silenced to modulate transcriptional dynamics over the course of development. Enhancers can be found upstream or downstream of genes, or even within introns [3][4][5]. They function independent of their orientation, do not necessarily regulate the closest genes, and sometimes regulate multiple genes at once [6,7]. In addition, several recent studies have demonstrated that some promoters-termed E-promoters-may act as enhancers of distal genes [8,9].
Consensus sequences (or canonical sequences) have been identified at certain protein binding sites, splice sites, and boundaries of protein-coding genes. However, there are no known consensus sequences that characterize enhancer function, making it peak-calling software developed for ChIP-seq such as MACS2 [16,18,19]. However, one must be cautious using ChIP-seq peak callers, at least without re-tuning the default parameters optimized for processing TF ChIP-seq [20].
In this paper, we describe key differences in the processing of STARR-seq versus ChIP-seq data. Due to increased complexity of the genomic screening library and sequencing depth requirements, STARR-seq coverage is highly non-uniform. This leads to a lower signal-to-noise ratio than a typical ChIP-seq experiment and makes estimating the background model more challenging, which could ultimately lead to falsepositive peaks. In addition, STARR-seq measures more of a continuous activity, similar to quantification in RNA-seq, than a discrete binding event. Therefore, STARR-seq peaks should be further evaluated using a notion of activity score. These differences necessitate a unique approach to processing STARR-seq data.
We propose an algorithm optimized for processing and identifying functionally active enhancers from STARR-seq data, which we call STARRPeaker. This approach statistically models the basal level of transcription, accounting for potential confounding factors, and accurately identifies reproducible enhancers. We applied our method to two whole human STARR-seq datasets and evaluated its performance against previous methods. We also compared an R package, BasicSTARRseq, developed to process peaks from the first STARR-seq data [15], which models enrichment of sequencing reads using a binomial distribution. We benchmarked our peak calls against known human enhancers. Thus, our findings support that STARRPeaker will be a useful tool for uniformly processing STARR-seq data.

Results and discussion
Precise measurement of STARR-seq coverage We binned the genome using a sliding window of length, l, and step size, s. Based on the average size of the STARR-seq library, we defined a 500-bp window length with a 100-bp step size to be the default parameter. Based on the generated genomic bins, we calculated the coverage of both STARR-seq input and output mapped to each bin. For calculating the sequence coverage, other peak callers and many visualization tools commonly use the start position of the read [15,21,22]. However, given that the average size of the fragments inserted into the STARR-seq libraries were approximately 500 bp, we expected that the read coverage using the read start position may shift the estimate of the summit of signal and dilute the enrichment. Some peak callers have used read densities of forward and reverse strands separately to overcome this issue [23,24]. To precisely measure the coverage of STARR-seq input and output, we first inferred the size of the fragment insert from paired-end reads and used the center of the fragment insert, instead of start position of the read, to calculate coverage. For inferring the size of the fragment insert, we first strictly filtered out reads that were not properly paired and chimeric. Chimeric alignments are reads that cannot be linearly aligned to a reference genome, implying a potential discrepancy between the sequenced genome and the reference genome and indicative of a structural variation or a PCR artifact [25]. We also filtered out read pairs that had a fragment insert size greater than l max and less than l min . By default, we filtered out fragment insert sizes less than 200 bp and greater than insert and counted the fragment depth for each genomic bin. To assess the benefit of using fragment-based coverage, we compared the coverage calculated using the center of fragment insert to an alternate model using the start position of the sequencing read. We found that the position of the peaks shifted up approximately 200 bp when we used the alternate model (Fig. 1a, Additional file 1: Fig. S1A). Such a shift caused by the read-based coverage could lead to the omission of TF binding sites located at the boundary. Moreover, we observed that the read-based coverage diluted the overall STARR-seq signal; as a result, peaks calculated based on the alternate model had lower fold enrichment (ratio of STARR-seq output to input) and were less confident (based on statistical significance of the identified peaks) and broader in size ( Fig. 1b-d, Additional file 1: Fig. S1B-D). Overall, the fragment-based coverage offered a more accurate representation of the STARR-seq readout compared to the read-based coverage counting scheme. We illustrated the benefits of using the center of the fragment in Fig. 1e, which compares the average size of peaks and fold enrichments using alternate coverage counting methods.

Controlling for potential systemic bias in the STARR-seq assay
To unbiasedly test for the regulatory activity, a model needs to control for potential systemic biases inherent to generating STARR-seq data. STARR-seq measures the ratio of transcribed RNA to DNA for a given test region and determines whether the test region can facilitate transcription at a higher rate than the basal level. This is based on the assumption that (1) the basal transcriptional level stays relatively constant across the genome and (2) the transcriptional rate is a reflection of the regulatory activity of the DNA insert. However, these assumptions may not always be true, and one needs to consider potential systemic biases that can interfere with the quantification of regulatory activity when analyzing the data.
We next tested whether potential sequencing biases and other covariates confounded STARR-seq readouts ( Fig. 2 and Additional file 1: Fig. S2). We found that STARR-seq RNA coverage was significantly correlated with GC content (PCC 0.61; P value 1E −299) and mappability (PCC 0.45; P value 2.9E−148). This could be attributed to intrinsic sequencing biases in library preparation. A genome-wide reporter library is made from randomly sheared genomic DNA, but DNA fragmentation is often non-random [26]. Studies also have suggested that epigenetic mechanisms and CpG methylation may influence fragmentation [27]. Furthermore, the isolated polyadenylated RNAs are reverse transcribed and PCR-amplified before sequencing, and this process can further confound the sequenced candidate fragments.
Notably, we found that STARR-seq coverage was also significantly confounded by RNA thermodynamic stability (PCC − 0.55; P value 0). Unlike ChIP-seq, where both the experiment and input controls derive from the same DNA origin, STARR-seq experiments measure the regulatory potential from the abundance of transcribed RNA, which adds a layer of complexity. For example, RNA structure and co-transcriptional folding might potentially influence the readout of STARR-seq experiments [28]. Singlestranded RNA starts to fold upon transcription and the resulting RNA structure might influence the measurement of regulatory activity. Previously, researchers suggested a potential linkage between RNA secondary structure and transcriptional regulation [29].
In addition, the resulting transcribed RNA undergoes a series of post-transcriptional regulation, and RNA stability might play a critical role. Moreover, previous reports have shown that the degradation rates-the main determinant of cellular RNA levels [30]vary significantly across the genome and that RNA stability correlates with functionality [31,32].
Based on these findings, we built a regression-based model that accounts for various confounding variables of test sequence fragments to unbiasedly identify potential Fig. 1 Comparison of STARR-seq output coverage calculated using the center of the fragment to using the start position of the sequencing read. a Distribution of the shift in final peak locations resulting from using two alternative coverage counting schemes in HepG2. Comparison of b overall fold enrichment level, c P value, and d size of resulting peaks. e Example highlighting the difference between fragment-based and read-based coverage counting schemes and their resulting peak calls from HepG2 STARR-seq data. Asterisks represents statistical significance using the Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction; *P ≤ 0.05, **P ≤ 0.01, ***P ≤ 0.001, ****P ≤ 0.0001 enhancer regions from STARR-seq data. Note that many of the covariates have appreciable correlation with each other. However, we did find, using stepwise forward selection, that each of them contributes substantially and independently to the model fit as assessed by Akaike information criterion (AIC) and Bayesian information criterion (BIC) (Additional file 1: Fig. S3).

Accurate modeling of STARR-seq using negative binomial regression
To model the fragment coverage data from STARR-seq using discrete probability distribution, we assumed that each genomic bin is independent and identically distributed, as specified in the Bernoulli trials [33]. That is, each test fragment can only map to a single fixed-length bin. Therefore, we only considered a non-overlapping subset of bins for modeling and fitting the distribution. We also excluded bins not covered by any genomic input or those in which the normalized input coverage was less than a minimum quantile t min , since these regions do not have sufficient power to detect enrichment. We selected the bin size and the minimum coverage based on the experimental design of STARR-seq. We simulated and fit various discrete probability distributions to STARR-seq output coverage. We observed that the STARR-seq output coverage data was overdispersed and fit the best with a negative binomial distribution (Fig. 3a). Moreover, a Q-Q plot of expected coverage against observed coverage further demonstrated that the negative binomial model provides the best fit for the STARR-seq readout (Fig. 3b). To demonstrate the generalizability of the method and assumptions, we tested the model fit against previously published datasets that utilized different STARR-seq protocols (Additional file 1: Fig. S4). We observed consistent properties of the STARRseq data across the different STARR-seq protocols and datasets.
In principle, we can also detect negative enrichments in the STARR-seq output coverage, suggesting that some candidate fragments can repress the basal transcriptional activity. However, these regions may contain sequences that can destabilize mRNAs. In addition, additional experiments are necessary to demonstrate that STARRseq can reliably detect silencers. Therefore, we focus on positive enrichments in this study, and we defer to a systemic method specifically designed for identifying silencers for this task [34].

Peak-calling algorithm
To accurately model the ratio of STARR-seq output fragment coverage (RNA) to input fragment coverage (DNA) while controlling for potential confounding factors, we applied a negative binomial regression. The overview of our model is outlined in Fig. 4. Our model starts by fitting an analytical distribution to the observed fragment coverage across fixed non-overlapping genomic bins. In doing so, we use covariates to model expected counts in the form of multiple regression. Subsequently, once a model is fit, we evaluate the likelihood of obtaining the observed fragment counts and assign P values using the null negative binomial distribution. In this testing phase, we use flexible genomic bins with a sliding window in order to find enrichment peaks at a higher resolution. Genomic bins with significant enrichments are selected based on their adjusted P values using multiple testing correction. Finally, peak locations are fine-tuned to the Fig. 3 STARR-seq output coverage is fit against simulated coverage using three distribution models; negative binomial, binomial, and Poisson. a Density histogram of simulated distribution against STARR-seq output coverage. b Q-Q plot of simulated distribution against STARR-seq output coverage. The red solid line represents where the observed count equals the expected count summit of the direct fragment coverage. Note that the adjusted P value only refers to the unlikelihood of a candidate region being an enhancer by chance while the fold enrichment can be directly interpreted as a quantitative measure of enhancer activity.
Let Y be a vector of STARR-seq output (RNA) coverage, then y i for 1 ≤ i ≤ n denotes the number of RNA fragments from a STARR-seq experiment mapped to the ith bin from the total of n genomic bins. Let t i be the number of input library fragments (DNA) mapped to the ith bin. We define X to be the matrix of covariates, where x i ! is the vector of covariates corresponding to the ith bin and x ij is the jth covariate for the ith bin.
A negative binomial is a generalization of a Poisson regression that allows the variance to be different from the mean, shaped by the dispersion parameter θ. There are two alternative forms of parameterization for a negative binomial-NB1 and NB2- Fig. 4 Overview of STARRPeaker peak-calling scheme. a In contrast to using read depth (gray), fragment depth (red) offers more precise and sharper STARR-seq output coverage. Fragment inserts are directly inferred from properly paired reads. b Workflow of STARRPeaker describing how coverage is calculated for each genomic bin and modeled using a negative binomial regression model. The analysis pipeline can largely be divided into four steps: (1) binning the genome; (2) calculating coverage and computing covariate matrix; (3) fitting the STARR-seq data to the NB regression model; and (4) peak calling, multiple hypothesis testing correction, and adjustment of the center of peaks which were first introduced by Cameron and Trivedi [36]. The difference between NB1 and NB2 is in the conditional variance of y i . Assuming y i has mean λ i , the general variance function follows the form ω i = λ i + αλ i p , where α is a scalar parameter. NB1 uses p = 1, whereas NB2 uses the quadratic form of variance with p = 2. We use the most common implementation of the negative binomial, NB2, hereafter. The variance for the NB2 model is given as We assume that the majority of genomic bins will have a basal level of transcription, the expected fragment counts at each ith bin, E(y i ), represents the mean incidence, μ i , and the count of RNA fragments Y follows the traditional negative binomial (NB2) distribution.
Negative binomial regression model The regression term for the expected RNA fragment count can be expressed in terms of a linear combination of explanatory variables, a set of m covariates ( x ! ). We use the input library variable t i as one covariate. For simplicity, we denote t i as x 0i hereafter.
Alternatively, instead of using the input library variable t i as one covariate, we can directly use it as an offset variable. Generally, a fractional observation cannot be modeled using discrete probability. However, an offset variable in a generalized linear model can be used to correct the response term to behave like a fraction. One advantage of using the input variable as an "exposure" to the RNA output coverage is that it allows us to directly model the basal transcription rate (the ratio of RNA to DNA) as a rate response variable. This mode could be beneficial in multiple scenarios. First, for a STAR R-seq dataset in which the guide DNA library only contains discrete elements, direct modeling of the basal transcription rate would provide a more accurate measure of activity, especially in the absence of readouts from adjacent regions. Second, if a user ignores the effect of covariate, this mode simplifies the model and provides peaks purely based on the ratio of RNA to DNA. More details on this alternative parameterization are included in the "Methods" section. In our STARRPeaker model, we used four covariates; fragment coverage of input genomic libraries, GC content, mappability, and the thermodynamic stability of genomic libraries.

Maximum likelihood estimation
We fit the model and estimate regression coefficients using the maximum likelihood method, where log-likelihood function is shown as follows.
Substituting μ i with the regression term, the log-likelihood function can be parameterized in terms of regression coefficients, β.
We can determine the maximum likelihood estimates of the model parameters by setting the first derivative of the log-likelihood with respect to β, the gradient, to zero, and there is no analytical solution for β. Numerically, we iteratively solve for the regression coefficients β and the dispersion parameter θ, alternatively, until both parameters converge.

Estimation of P value
The P value is defined as the probability of observing equal or more extreme value than the observed value at the ith bin, y i , under the null hypothesis.
As defined earlier, we assume the random variable Y comes from a negative binomial distribution with the fit mean at the ith bin, μ i , as the expected value, and θ as the dispersion parameter. Then, we can estimate the P value from the cumulative distribution function CDF, which is the sum of the probability mass function f Y from 0 to y i − 1.
Substituting (1) gives Finally, we calculate the false discovery rate using the Benjamini and Hochberg method [38].

Application of STARRPeaker
We applied STARRPeaker to two whole human genome STARR-seq experiments, K562 and HepG2, utilizing origin of replication (ORI)-based plasmids [39]. Based on peaks identified from these datasets, we evaluated the quality and characteristics of the identified enhancers as well as the performance of the peak caller by comparing to external enhancer resources.

Initial evaluation of STARRPeaker enhancers
Our model assumes that most genomic regions have a basal level of transcription activity. To validate that our model is well calibrated under this assumption, we examined the P value distribution against a theoretical uniform distribution. As expected, most observed P values followed the null distribution, and the P values only deviated from expectation for very low values (Additional file 1: Fig. S5).
We processed two biological replicates from each cell type independently and assessed the correlation between each pair (see "Methods" for detail). We identified 50,389 and 52, 927 candidate enhancers from HepG2 replicates 1 and 2, and by consolidating peaks from both replicates, we identified 32 Table S1). Overall, we observed high correlation of RNA fragment coverage between two replicates (PCC = 0.99 for both HepG2 and K562; see Additional file 1: Fig. S6). This result indicates the correlation is dominated by negatives. Although the total number of peaks varied between HepG2 and K562, we observed a comparable number of peaks within the accessible region of the genome. We found 12,019 (36.5%) and 11,420 (55.8%) candidate enhancers from HepG2 and K562, respectively, within the open chromatin defined by ENCODE DNase-seq hotspots. Consistent with previous findings [39], a substantial fraction of candidate enhancers was epigenetically silenced at the chromatin level. However, as demonstrated previously using a histone deacetylase inhibitor (HDAC) [16], these poised enhancers can become functional under a more transcriptionally permissive environment. Therefore, episomal reporter assays like STARR-seq have the unique advantage of detecting potential enhancer activity independent from chromatin context. We would like to note that it is important to identify poised enhancers located in heterochromatic regions of the genome, which could become functional during developmental or pathological time courses.

Assessment of robustness and reproducibility of the method
A reliable peak-calling method allows one to identify stable peaks from suboptimal datasets. To evaluate the robustness of the STARRPeaker method, we generated random subsets of HepG2 whole-genome STARR-seq library after aligning the reads and compared the quality of the peak calls. We subsampled randomly at various rates from 20 to 80% of the total dataset. We found that STARRPeaker was able to reliably identify approximately 90% of the candidate enhancers (consolidated) using 80% of the original sequencing library and 80% of the candidate enhancers using 40% of the original sequencing library (Additional file 1: Fig. S7A). When we focused on strong enhancer candidates, approximately 98% of the top 5000 enhancers were recovered using only 60% of the original sequencing library (Additional file 1: Fig. S7B). However, the quality of the peak calls started to deteriorate when 40% or less were used.

Evaluation of potential orientation bias in candidate enhancers
In general, enhancers are thought to function independent of orientation [40]. However, the fragment counts in one orientation could be skewed over the other due to orientation-specific activities, PCR, or sequencing artifacts. To test for potential orientation-based biases, we ran a series of binomial tests on the candidate enhancers we identified and evaluated for possible orientation-specific activities (see "Methods").
We observed a small fraction of candidate enhancers showing orientation bias [2.58% for HepG2 (n = 850); 4.49% for K562 (n = 919); FDR ≤ 0.01] in both replicates (Additional file 1: Fig. S8). Furthermore, only a few candidates showed extreme bias [n = 3 for HepG2; n = 4 for K562; > 90% fragments on one strand]. Thus, true orientationdependent activities are unlikely in our STARR-seq data, but that the orientation may have an effect on the efficiency. These findings provide further support that enhancers function independent of orientation.

Performance comparison to other peak-calling algorithms
We evaluated the performance of STARRPeaker by comparing it to previously used alternative methods, namely BasicSTARRseq and MACS2.
First, we qualitatively assessed the peak-calling algorithms using a simulated dataset where the ground truth exists. The simulation was primarily designed for positives in the dataset, and as a result, it emphasizes the sensitivity (as opposed to specificity) of each method. We created an artificial STARR-seq dataset that contains 28 spike-in controls; a hybrid of DNA input and RNA output libraries with active elements at predefined loci (Additional file 7: Table S6; see "Methods" for details). All three methods successfully identified peaks at 28 control regions (Additional file 1: Fig. S9). However, we noticed that BasicSTARRseq peaks were fragmented and off-centered from the controls due to its limitations of fixed peak size and read-based coverage calculation. As a result, several false-positive peaks were called adjacent to the controls. We calculated the sensitivity and specificity of each method using 28 controls as the gold standard (Additional file 8: Table S7). We considered a result positive if at least 80% of the peak overlapped with the control, and we assumed the presence of approximately 1000 true negative regions. Both STARRPeaker and MACS2 had 100% sensitivity and specificity, whereas BasicSTARRseq had 75% sensitivity and 94.9% specificity.
Second, we quantitatively assessed the peak-calling algorithms using the whole human genome STARR-seq dataset. After uniformly calling peaks from each method using the recommended default settings, we evaluated the quality of the candidate enhancers identified. We found that both BasicSTARRseq and MACS2 called significantly more peaks (4-to 20-fold higher) than STARRPeaker (Additional file 5: Table S4). While it is uncertain how many true enhancers were present in each sample, we had to ensure that we make a fair comparison across different methods due to the tradeoff between sensitivity and specificity. An increase in sensitivity is generally achieved at the expense of a decrease in specificity, as described in receiver operating characteristic curves. In our context, a method having higher specificity suffers from having less overlap with open chromatin and previously identified enhancers from other assays. Suppose each method is generating a set of randomized peaks, then the method with the greater number of peaks is likely to have more overlap solely by chance. To eliminate this artifact, we used a uniform P value threshold of 0.001 and subsampled the peaks down to the same number before the comparison. After uniformly processing the dataset using each method, we measured the level of epigenetic profile enrichment around the peaks. We observed higher enrichment of DNase-hypersensitive sites, as well as more distinct double-peak patterns of H3K27ac and H3K4me1, using STARRPeaker compared to BasicSTARRseq or MACS2 (Fig. 5, Additional file 1: Fig. S10). Furthermore, STARRPeaker peaks had significantly higher enrichment of TF binding events (based on the number of TF ChIP-seq binding sites) compared to the peaks identified using other methods.

Comparison to previously characterized enhancers
First, we compared the peaks identified by STARRPeaker to previously characterized enhancers from HepG2 or K562 cell lines by CAGE [41], MPRA [17,42], and STARRseq [19] (Fig. 6, Additional file 3: Table S2). Since most of the previous enhancer assays were not at a genome-wide scale and were limited in scope to specific loci, we focused on how many previously characterized enhancers were recovered using different methods. Overall, we observed a higher fraction of STARRPeaker peaks overlapping with external datasets compare to other methods. Moreover, we found higher overlaps when peaks from both replicates were merged, due to fewer but more precise candidate enhancers from merging replicates. However, we noticed reduced agreement across different types of enhancer assays. Low overlap between assays may arise from different formats or layouts of reporter plasmids, such as differing enhancer cloning sites or promoters, or differences in the complexity of the screening library. Furthermore, CAGE is an entirely different assay from episomal reporter assays like MPRA and STARR-seq, with enhancers defined based on bidirectional transcripts originating from an eRNA. Second, we examined the nine distal enhancers from the GATA1 and MYC loci characterized in depth by a CRISPRi tiling screen [43]. Using the nine enhancers as the ground truth and assuming there are approximately 1300 potential negative elements in the span of a 1.3-Mb genomic sequence, we performed sensitivity and specificity analyses of three competing methods. We think this is a substantial dataset to benchmark the performance as it provides an orthogonal line of evidence for true in vivo enhancers and highlights that the most genomic regions are dominated by negatives. As speculated, we found that STARRPeaker had the highest specificity. While both Basic-STARRseq and MACS2 identified a few more enhancers, their false-positive rate was much higher than that of STARRPeaker (Additional file 1: Fig. S11, Additional file 9: Table S8). Furthermore, upon close examination of three enhancers that were missed by STARRPeaker, we observed the coverage was much lower than other regions identified as enhancers.

Application to external STARR-seq datasets
To ensure that STARRPeaker can be generally applied to different variants of STARRseq assays, we tested STARRPeaker on previously published STARR-seq datasets.
First, we applied STARRPeaker to the whole-genome ORI-STARR-seq dataset on HeLa-S3 [39] and assessed the quality of the peaks identified. Consistent with the previous claim that IFN-I signaling may induce false-positive enhancers, we identified more peaks in untreated HeLa-S3 samples (n = 28,381) compared to inhibitor-treated samples (n = 16,150). Furthermore, peaks from untreated samples had lower enrichment of chromatin accessibility (DNase-seq) than those from inhibitor-treated samples, supporting that TBK1/IKK/PKR inhibition reduces false-positive enhancer signals related Fig. 6 Comparison of peaks using an external dataset for a HepG2 or b K562 cell lines. Peaks identified from STARRPeaker as well as BasicSTARRseq and MACS2 were compared against a published dataset. For a fair comparison, all peaks were centered at the summit, uniformly thresholded using P value < 0.001, and 20,000 peaks were randomly drawn from peaks identified by each peak caller using the recommended settings. The fraction of overlap was computed for each replicate. We used the total number of peaks in each dataset as the denominator. We considered it an overlap when at least 50% of peaks intersected each other to IFN-I signaling (Additional file 1: Fig. S12A). Moreover, STARRPeaker recovered 77.5% (n = 7451) of published peaks, which were called using BasicSTARRseq and then further shortlisted using a stringent threshold (P value 1E−5 with corrected enrichment ≥ 4). When we compared the quality of STARRPeaker peaks (n = 16,150) to the published post hoc filtered peaks (n = 9610), we found that STARRPeaker peaks were highly enriched with chromatin accessibility signals despite having 6540 additional peaks from the same HeLa-S3 dataset (Additional file 1: Fig. S12B). When we applied the same post hoc filtering approach to STARRPeaker peaks, the chromatin accessibility enrichment was further improved (Additional file 1: Fig. S12C).
Second, we tested if STARRPeaker can be reliably applied to captured STARR-seq datasets (Cap-STARR-seq). We applied STARRPeaker to a previously characterized GM12878 STARR-seq dataset based on an ATAC-seq-capture technique called HiDRA [44] and compared its performance with published results. The HiDRA dataset was reported to have~65,000 regions with enhancer function. In the STARRPeaker run, we identified 52,857 regions with significant enhancer activities from the five replicates they produced. Approximately half of STARRPeaker peaks overlapped with the published results (n = 26,318). While it is debatable to claim that one method is superior to the other, this result clearly demonstrates that STARRPeaker can be applied to the Cap-STARR-seq dataset.
Third, we further evaluated the performance of the peak-calling methods by applying STARRPeaker and two other peak-calling methods to another published Cap-STARRseq dataset on K562 [19]. The dataset covers approximately 91% of the surrounding 3 Mb of the MYC locus. Consistent with the earlier analysis, we observed that STAR RPeaker is highly specific and identifies fewer candidate enhancers (n = 26) compared to the other methods (BasicSTARRseq n = 223; MACS2 n = 136). Furthermore, a fourway comparison (STARRPeaker, BasicSTARRseq, MACS2, and published peaks) showed that all of the STARRPeaker peaks overlapped with peaks from other methods but not the other way around (Additional file 1: Fig. S13). These results indicate that STARRPeaker is more robust and reliable at identifying reproducible candidate enhancers from various STARR-seq datasets than previous methods.

Conclusions
In summary, we developed a reliable peak-calling analysis pipeline named STARRPeaker that is optimized for large-scale STARR-seq experiments. To illustrate the utility of our method, we applied it to two whole human genome STARR-seq datasets from K562 and HepG2 cell lines, utilizing ORI-based plasmids.
STARRPeaker has several key improvements over previous approaches including (1) precise and efficient calculation of fragment coverage; (2) accurate modeling of the basal transcription rate using negative binomial regression; and (3) accounting for potential confounding factors, such as GC content, mappability, and the thermodynamic stability of genomic libraries. We demonstrate the superiority of our method over previously used peak callers, supported by strong enrichment of epigenetic marks relevant to enhancers and overlap with previously known enhancers.
To fully understand how noncoding regulatory elements can modulate transcriptional programs in human, STARR-seq active regions must be further characterized and validated within different cellular contexts. For example, recent applications of CRISPR-dCas9 to genome editing have allowed researchers to epigenetically perturb and test these elements in their native genomic context [45,46]. The next step for CRISPRbased functional screens is to overcome the current limitation of a small scale by leveraging barcodes and single-cell sequencing technology [47]. In the meantime, we envision that the STARRPeaker framework could be utilized to detect and quantify enhancers at the whole-genome level, thereby aiding in prioritizing candidate regions in an unbiased fashion to maximize functional characterization efforts.

Generating an ORI-STARR-seq input plasmid library
We sonicated human male genomic DNA (Promega #G1471) using a Covaris S220 sonicator (duty factor, 5%; cycle per burst, 200; 40 s) and ran it on a 0.8% agarose gel to size-select 500-bp fragments. After gel purification using a MinElute Gel Extraction kit (Qiagen), we end-repaired, ligated custom adaptors, and PCR-amplified DNA fragments using Q5 Hot Start High-Fidelity DNA polymerase (NEB) (98°C for 30 s; 10 cycles of 98°C for 10 s, 65°C for 30 s, and 72°C for 30 s; 72°C for 2 min) to add homology arms for Gibson assembly cloning.
We used AgeI-HF (NEB) and SalI-HF (NEB) to linearize the hSTARR-seq_ORI plasmid (gift from Alexander Stark; Addgene plasmid #99296) and cloned the PCR products into the vector using Gibson Assembly Master Mix (NEB); we set up 60 replicate reactions to maintain complexity. We purified the assembly reactions using SPRI beads (Beckman Coulter), dialyzed them using Slide-A-Lyzer MINI dialysis devices (Thermo Scientific), and concentrated them using an Amicon Ultra-0.5 device (Amicon). We transformed the reaction into MegaX DH10BTM T1 electrocompetent cells (Thermo Fisher Scientific) (with 25 replicate transformations to maintain complexity) and let them grow in 12.5-L LB-Amp medium until they reached an optical density of~1.0. We extracted the plasmids using a Plasmid Gigaprep Kit (Qiagen) and dialyzed the plasmid prep using Slide-A-Lyzer MINI dialysis devices before electroporation.

Electroporation-mediated transfection of ORI-STARR-seq input plasmid library into K562 and HepG2 cell lines
We electroporated the ORI-STARR-seq library using an AgilePulse Max (Harvard Apparatus) and generated two biological replicates for each cell line. For K562 cells, we electroporated 5.6 mg of input plasmid library into 700 million cells per biological replicate by delivering three 500-V pulses (1 ms duration with a 20-ms interval). For HepG2 cells, we electroporated 8 mg of input plasmid library into one billion cells in one replicate, and 5.6 mg into 700 million cells in another replicate by delivering three 300-V pulses (5 ms duration with a 20-ms interval).

Output RNA library
We harvested cells 24 h after electroporation and extracted total RNA using an RNeasy Maxi kit (Qiagen). We further isolated polyA-plus mRNA using Dyna-beads® Oligo (dT) kit (Thermo Fisher Scientific), treated it with TURBO DNase (Invitrogen), and purified the reaction using an RNeasy MinElute Kit (Qiagen). We synthesized cDNA using SuperScript III (Thermo Fisher Scientific) with a custom primer that specifically recognizes mRNAs that had been transcribed from the ORI-STARR-seq library. After reverse transcription, we treated the reactions with a cocktail of RNase A and RNase T1 (Thermo Fisher Scientific). We split cDNA samples into 160 replicate sub-reactions, and PCR-amplified each sub-reaction with a primer with a unique index (helping to identify PCR duplicates) using Q5 Hot Start High-Fidelity DNA polymerase (NEB) with the following program: 98°C for 30 s; cycles of 98°C for 10 s, 65°C for 30 s, 72°C for 30 s (until they reached midlog amplification phase; we cycled 18 cycles for K562 Rep.1; 16 cycles for K562 Rep. 2; 18 cycles for HepG2 Rep. 1; and 15 cycles for HepG2 Rep2); 72°C for 2 min). After PCR, we re-combined all sub-reactions into one and purified it with Agencourt Beads. We generated 100-bp paired-end reads for each biological replicate on an Illumina Hiseq4000 at the University of Chicago Genome Facility.

Input DNA library
We PCR-amplified a total of 200 ng of input plasmid library (in 16 replicate reactions) using Q5 Hot Start High-Fidelity DNA polymerase (NEB) with the following program: 98°C for 30 s; 4 cycles of 98°C for 10 s, 65°C for 30 s, and 72°C for 20 s; 8 cycles of 98°C for 10 s and 72°C for 50 s; 72°C for 2 min. After PCR, we combined all products into one and purified it with Agencourt Beads. We generated 100-bp paired-end reads on an Illumina Hiseq4000 at the University of Chicago Genome Facility.

Sequencing and preprocessing
For each of 160 replicates, paired-end sequencing reads were aligned to the human reference genome GRCh38 downloaded from the ENCODE portal (ENCSR425FOI) using BWA-mem (v0.7.17). Alignments were filtered against unmapped, secondary alignments, mapping quality score less than 30, and PCR duplicates using SAMtools (v1.9) and Picard (v2.9.0). All of the replicates were pooled and sorted for downstream analysis.

Fitting of distributions to the STARR-seq dataset
To build a statistical model that best describes the STARR-seq readout, we tested fitting of various univariate distributions to the output coverages of the STARR-seq dataset. To eliminate the influence of input coverage on output coverage, we subsampled bins with an input coverage value of 20 (approximately a median input coverage for both the HepG2 and K562 datasets) and used their output coverage values for the underlying observed distribution. We fit binomial, Poisson, and negative binomial distributions and estimated parameters using MLE. For binomial distributions, we assumed the number of the trial as the sum of the STARR-seq input and output coverage, and the probability of success as the sum of the STARR-seq output coverage divided by the total input and output coverage, as described previously [15]. The Poisson distribution is described by a single parameter λ, whereas the negative binomial distribution is described by both μ and the shape parameter θ. For example, we found λ = 41.6, μ = 41.6, and θ = 8.9 for the HepG2 dataset. Based on estimated parameters, we plotted the expected distribution quantiles from 0 to 100 percentiles against the actual observed quantiles. We plotted a straight diagonal line to show how well each distribution fit with the actual observed data.

Negative binomial distribution
A negative binomial distribution, which arises from Gamma-Poisson mixture, can be parameterized for y ≥ 0 as follows.
Substituting gives: Rearranging gives: Alternative parameterization of negative binomial regression using a rate model Alternative parameterization allows STARR-seq data to be modeled as a rate model. In contrast to using input coverage as one of the covariates, we can consider it as "exposure" to output coverage. This "trick" allows us to directly model the basal transcription rate (the ratio of RNA to DNA) as a rate response variable. We defined the transcription rate (RNA to DNA ratio) as a new variable, π i .
If we assume the majority of genomic bins will have the basal transcription rate, we can model the transcription rate at each ith bin following the traditional negative binomial (NB2) distribution.
The expected basal transcription, E(π i ), becomes the mean incidence rate of y i per unit of exposure, t i .
By normalizing μ i by t i , we are modeling a rate instead of a discrete count using the negative binomial distribution. The regression term for the expected transcription rate can be expressed in terms of a linear combination of explanatory variables, j covariates Rearranging in terms of the expected value of y, or μ, gives The natural log of t i on the RHS ensures μ i is normalized in the model, acting as an offset variable. In STARRPeaker software, we allow users to optionally choose this alternative rate model (implemented as "mode 2") instead of the default covariate model described in the main text. This alternate model is useful if constant basal transcription is expected throughout the genome or if covariates are available for directly modeling the basal transcription rate π.

Evaluation of potential orientation bias
For all enhancer peaks identified from both the HepG2 and K562 cell lines, we evaluated if there was an overrepresentation of fragments in a specific orientation. If there is no orientation bias, the STARR-seq active region should be equally represented by both forward-and reverse-stranded fragments. We performed a binomial test for the statistical significance on how unlikely it is to have a fragment distribution skewed on one strand.
where p is 0.5 (equal chance of being either forward or reverse stranded), n is number of all supporting fragments, and k is the number of forward-stranded fragments. We adjusted P values using FDR (BH) and used 0.01 as a cutoff. We ran binomial tests for both replicates. To call a region orientation-biased, we ensured that genomic DNA fragments were equally represented in both forward and reverse strands, and we checked for significant strand bias in both replicates. If strand bias is already present in a genomic DNA library, it is more likely that the bias will be due to amplification by PCR rather than being a result of orientation-specific activity.

MACS2
We used MACS2 version 2.1.1 [23] with the optimal parameters suggested for a human STARR-seq dataset (-f BAMPE -g hs). We also used an option to allow duplicates in read (--keep-dup all), since our STARR-seq dataset was multiplexed. We called peaks with an FDR cutoff of 0.05 (-q 0.05), as recommended by the author of the software.

Calculating folding free energy
We used the LinearFold [48] algorithm to estimate the folding energy of each genomic bin iteratively across the whole genome. Specifically, we used the Vienna RNAfold thermodynamic model [49] with parameters from Mathews et al. [50]. We implemented a parallel processing scheme to leverage multicore processors to expedite the calculation of folding free energy.

Recommended parameters for the model
We determined the model parameters from the experimental design of the STARR-seq assay. Our STARR-seq input library was based on DNA fragments that were sizeselected on an agarose gel for 500 bp. Therefore, we defined the bin size to be 500 bp with a step size of 100 bp. Furthermore, we chose the minimum and maximum template size to be 200 and 1000 bp, respectively, to recover RNA generated from the STARR-seq input library. We recommend setting the minimum template size no less than 200 bp, as a shorter template can become more prone to PCR bias during sequencing. Based on the sequencing depth, we determined the minimum coverage for the whole genome to be 10. To guide users on how to choose these parameters, we provide a sensitivity analysis of how changing the parameters affects the results (Additional file 6: Table S5). Overall, more than 80% of peaks were consistent regardless of varying the parameters. We found that changing the bin size affected the average peak size and changing the step size affected the resolution. However, we found that using a 50-bp step size approximately doubled the processing time.

Simulation of the STARR-seq dataset
We created an artificial STARR-seq dataset where the ground truth exists. We used the original STARR-seq input library as the base of the simulation. We focused on an approximately 2.  [43]. We selectively generated paired-end reads that only support control regions. Since we expect the background regions to have a basal level of transcription, we matched the read distribution of the input library to make the transcription rate equal to 1. All 28 control regions were then merged to create an artificial STARR-seq output library, and finally, the read coverage was visually inspected to ensure that it resembles one from the actual STARR-seq dataset.

Supplementary information
The online version contains supplementary material available at https://doi.org/10.1186/s13059-020-02194-x.  Figure S6. Correlation of RNA fragment coverage between replicates for (A) HepG2 or (B) K562 cell lines. Pairwise coverage was calculated using 1000 bp bins across the genome. The X-and Y-axis represent the natural log of fragment counts. A pseudo count of 1 was added to the fragment counts for plotting only. Bins with abnormally high fragment counts were removed to avoid inflation of the Pearson correlation. We used the median absolute deviation method with a scaling factor of 200 to filter extremely large deviations from the median. Supplementary Figure S7. Comparison of peaks called from subsamples of the original STARR-seq library, highlighting the robustness of STARRPeaker using (A) the whole dataset and (B) 5000 subset of peaks. Supplementary Figure S8. Orientation biases analysis for (A-B) HepG2 or (C-D) K562 cell lines. The ratio between forward and reverse stranded fragments was tested for statistical significance using a binomial test. Orange dots represent peaks with significant strand bias (FDR q-value < 0.01). Supplementary Figure S9. Comparison of peaks identified by various methods using a simulated STARR-seq dataset containing 28 spike-in control regions. Supplementary Figure S10. Enrichment of epigenetic signals around peaks in K562. All peaks were centered at the summit, uniformly thresholded using P value < 0.001, and 10,000 peaks were randomly selected. Ag-