Variant calling and sequence read handling
Enrich2 implements alignment-free variant calling. Variant sequences are expected to have the same length and start point as the user-supplied wild-type sequence, which allows Enrich2 to compare each variant to the wild-type sequence in a computationally efficient manner. In addition to this alignment-free mode, an implementation of the Needleman-Wunsch global alignment algorithm [42] is included that will call insertion and deletion events. Enrich2 supports overlapping paired-end reads and single-end reads for direct variant sequencing, as well as barcode sequencing for barcode-tagged variants.
Calculating enrichment scores
For selections with at least three time points, we define T, which includes all time points, and T′, which includes all time points except the input (t
0). The frequency of a variant (or barcode) v in time point t is the count of the variant in the time point (c
v,t
) divided by the number of reads sequenced in the time point (N
t
).
$$ {f}_{v,t}=\frac{c_{v,t}}{N_t} $$
The change in frequency for a variant v in a non-input time point t ∊ T′ is the ratio of frequencies for t and the input.
$$ {r}_{v,t}=\frac{f_{v,t}}{f_{v,0}} $$
Instead of using this raw change in variant frequency, we divide each variant’s ratio by the wild-type (wt) variant’s ratio.
$$ \frac{r_{v,t}}{r_{wt,t}}=\frac{c_{v,t}{c}_{wt,0}}{c_{v,0}{c}_{wt,t}} $$
Because the library size terms (N
t
and N
0) in the frequencies cancel out, the ratio of ratios is not dependent on other non-wild-type variants in the selection. In practice, we add \( \frac{1}{2} \) to each count to assist with very small counts [43] and take the natural log of this ratio of ratios.
$$ {L}_{v,t}= \log \left(\frac{\left({c}_{v,t}+\frac{1}{2}\right)\left({c}_{wt,0}+\frac{1}{2}\right)}{\left({c}_{v,0}+\frac{1}{2}\right)\left({c}_{wt,t}+\frac{1}{2}\right)}\right) $$
This equation can be rewritten as
$$ {L}_{v,t}= \log \left(\frac{c_{v,t}+\frac{1}{2}}{c_{wt,t}+\frac{1}{2}}\right)- \log \left(\frac{c_{v,0}+\frac{1}{2}}{c_{wt,0}+\frac{1}{2}}\right) $$
If we were to regress L
v,t
on t ∊ T
′, we note that the second term is shared between all the time points and therefore only affects the intercept of the regression line. We do not use the intercept in the score, so instead we regress on M
v,t
and use all values of t ∊ T.
$$ {M}_{v,t}= \log \left(\frac{c_{v,t}+\frac{1}{2}}{c_{wt,t}+\frac{1}{2}}\right) $$
The score is defined as the slope of the regression line, \( {\widehat{\beta}}_v \). In practice, we regress on \( \frac{t}{ \max T} \) to facilitate comparisons between selections with different magnitudes of time points (e.g. 0/1/2/3 rounds versus 0/24/48/72 hours).
To account for unequal information content across time points with variable sequencing coverage, we perform weighted linear least squares regression [44]. The regression weight for M
v,t
is V
v,t
−1, where V
v,t
is the variance of M
v,t
based on Poisson assumptions [43] and is approximately
$$ {V}_{v,t}=\frac{1}{c_{v,t}+\frac{1}{2}}+\frac{1}{c_{wt,t}+\frac{1}{2}} $$
For selections with only two time points (e.g. input and selected), we use the slope of the line connecting the two points as the score. This is equivalent to the wild-type adjusted log ratio (L
v
) derived similarly to L
v,t
above.
$$ {L}_v= \log \left(\frac{c_{v,sel}+\frac{1}{2}}{c_{wt,sel}+\frac{1}{2}}\right)- \log \left(\frac{c_{v,inp}+\frac{1}{2}}{c_{wt,inp}+\frac{1}{2}}\right) $$
As there is no residual error about the fitted line, we must use a different method to estimate the standard error. We calculate a standard error (SE
v
) for the enrichment score L
v
under Poisson assumptions [24, 43].
$$ S{E}_v=\sqrt{\frac{1}{c_{\mathit{v,inp}}+\frac{1}{2}}+\frac{1}{c_{\mathit{wt,inp}}+\frac{1}{2}}+\frac{1}{c_{\mathit{v,sel}}+\frac{1}{2}}+\frac{1}{c_{\mathit{wt,sel}}+\frac{1}{2}}} $$
For experiments with no wild-type sequence, scores can be calculated using the filtered library size for each time point t, which is defined as the sum of counts at time t for variants that are present in all time points.
Combining replicate scores
To account for replicate heterogeneity, we use a simple meta-analysis model with a single random effect to combine scores from each of the n replicate selections into a single score for each variant. Each variant’s score is calculated independently. Enrich2 computes the restricted maximum likelihood estimates for the variant score (\( \widehat{\beta} \)) and standard error (\( {\widehat{\sigma}}_s \)) using Fisher scoring iterations [45]. Given the replicate scores (\( {\widehat{\beta}}_i \)) and estimated standard errors (\( {\widehat{\sigma}}_i \)) where i = 1, 2, …, n, the estimate for \( \widehat{\beta} \) at each iteration is the weighted average:
$$ \widehat{\beta}=\frac{{\displaystyle {\sum}_{i=1}^n}{\widehat{\beta}}_i{\left({\widehat{\sigma}}_s^2+{\widehat{\sigma}}_i^2\right)}^{-1}}{{\displaystyle {\sum}_{i=1}^n}{\left({\widehat{\sigma}}_s^2+{\widehat{\sigma}}_i^2\right)}^{-1}} $$
The starting value for \( {\widehat{\sigma}}_s^2 \) at the first iteration is:
$$ {\widehat{\sigma}}_s^2=\frac{1}{n-1}{\displaystyle \sum_{i=1}^n}{\left({\widehat{\beta}}_i-\overline{\widehat{\beta}}\right)}^2 $$
Enrich2 calculates the following fixed-point solution for \( {\widehat{\sigma}}_{s+1}^2 \):
$$ {\widehat{\sigma}}_{s+1}^2={\widehat{\sigma}}_s^2\frac{{\displaystyle {\sum}_{i=1}^n}{\left({\widehat{\sigma}}_s^2+{\widehat{\sigma}}_i^2\right)}^{-2}{\left({\widehat{\beta}}_i-\widehat{\beta}\right)}^2}{{\displaystyle {\sum}_{i=1}^n}{\left({\widehat{\sigma}}_s^2+{\widehat{\sigma}}_i^2\right)}^{-1}-\frac{{\displaystyle {\sum}_{i=1}^n}{\left({\widehat{\sigma}}_s^2+{\widehat{\sigma}}_i^2\right)}^{-2}}{{\displaystyle {\sum}_{i=1}^n}{\left({\widehat{\sigma}}_s^2+{\widehat{\sigma}}_i^2\right)}^{-1}}} $$
Because it is more computationally efficient to perform a fixed number of iterations for all variant scores in parallel than to test for convergence of each variant, Enrich2 performs 50 Fisher scoring iterations. In practice, this is more than sufficient for \( {\widehat{\sigma}}_s^2 \) to converge. We record the difference \( {\varepsilon}_s={\widehat{\sigma}}_s^2-{\widehat{\sigma}}_{s-1}^2 \) for the final iteration and identify any variants with high values for ɛ
s
as variants that failed to converge. No such variants were encountered in the analyses detailed here.
For the fixed-effect model [29], we calculate the variant score (\( {\widehat{\beta}}^{\mathit{\hbox{'}}} \)) and standard error (\( {\widehat{\sigma}}_s^{\mathit{\hbox{'}}} \)) using a weighted average of the replicate scores (\( {\widehat{\beta}}_i \)) where the weight for each score is the inverse of that variant’s variance (\( {\widehat{\sigma}}_{s-1}^2 \)). The standard error of the variant \( {\widehat{\sigma}}_s^{\mathit{\hbox{'}}} \) is:
$$ {\widehat{\sigma}}_s^{\hbox{'}}=\sqrt{\frac{1}{{\displaystyle {\sum}_{i=1}^n}{\widehat{\sigma}}_i^{-2}}} $$
The fixed-effect model was used for comparison purposes only and is not implemented in the Enrich2 software.
Derivation of predicted scores
The behavior of a variant v in a simulated binding experiment (e.g. phage display, yeast display) can be described in terms of the displaying entity’s likelihood of being selected in a given round [39, 46]. This likelihood is related to the binding affinity of each variant, and, by extension, the binding probability of an individual protein molecule under the experimental conditions. The relationship between variant binding affinity, monomer binding probability, and likelihood of selection will depend on the specifics of the experiment such as the number of molecules displayed per cell or phage, ligand concentration, and non-specific binding [39]. Each round of selection is a time point t in the analysis, so we can assign each variant a probability of being selected in a given time point (p
v,t
). We assume that p
v,t
= p
v,0 = p
v
(i.e. that the probability is constant throughout the selection) and that any grow out or amplification steps are uniform across all variants.
The initial variant population is determined by the variant population frequencies (\( {f}_{v,0}^{\mathit{\prime}} \)) and the size of the starting population (\( {N}_0^{\prime } \)).
$$ {c}_{v,\ 0}^{\prime }={f}_{v,\ 0}^{\prime }{N}_0^{\prime } $$
We note that \( {c}_{v,\mathrm{t}}^{\mathit{\prime}} \), \( {f}_{v,\mathrm{t}}^{\mathit{\prime}} \), and \( {N}_{\mathrm{t}}^{\prime } \) refer to the variant population itself, in contrast to the previously defined c
v,t
, f
v,t
, and N
t
, which refer to sequence reads derived from the variant population.
We define a
t
as a factor describing growth between round t and the previous round (a
0 = 1). We assume that a
t
is the same for all variants. The count for a variant in time point t+1 in terms the count in time point t is:
$$ {c}_{v,\ t+1}^{\prime }={a}_{t+1}{p}_v{c}_{v,\ t}^{\prime } $$
Therefore, the count for a variant in time point t given the starting count is:
$$ {c}_{v,t}^{\hbox{'}}={c}_{v,0}^{\hbox{'}}{\displaystyle \prod_{j=1}^t}{a}_j{p}_v={f}_{v,0}^{\hbox{'}}{N}_0^{\hbox{'}}{p}_v^t{\displaystyle \prod_{j=1}^t}{a}_j $$
We can write the ratio of variant counts in these terms and define the log ratio for binding experiments (M
v,t
').
$$ \frac{c_{v,t}^{\hbox{'}}}{c_{wt,t}^{\hbox{'}}}=\frac{f_{v,0}^{\hbox{'}}{N}_0^{\hbox{'}}\kern0.1em {p}_v^t{\displaystyle {\prod}_{j=1}^t}{a}_j}{f_{wt,0}^{\hbox{'}}{N}_0^{\hbox{'}}\kern0.5em {p}_{wt}^t{\displaystyle {\prod}_{j=1}^t}{a}_j}=\frac{f_{v,0\kern0.1em }^{\hbox{'}}{p}_v^t}{f_{wt,0\kern0.1em }^{\hbox{'}}{p}_{wt}^t} $$
$$ {M}_{v,t}^{\hbox{'}}= \log \left(\frac{c_{v,t}^{\hbox{'}}}{c_{wt,t}^{\hbox{'}}}\right)=t\cdot \log \left(\frac{p_v}{p_{wt}}\right)+ \log \left(\frac{f_{v,0}^{\hbox{'}}}{f_{wt,0}^{\hbox{'}}}\right) $$
If we substitute t for \( {t}^{\prime }=\frac{t}{ \max \kern0.3em T} \), we find that the expected score for binding experiments under the regression scoring model (\( {\beta}_v^{\prime } \)) should be related to the variant selection probability (p
v
) by:
$$ {\beta}_v^{\prime }=\left( \max \kern0.3em T\right) \log \left(\frac{p_v}{p_{wt}}\right) $$
The behavior of a variant v in a simulated growth experiment can be described by the growth rate at time t (μ
v
(t)). Unlike in the round-based binding experiment case, time in growth experiments is modeled as continuous. We assume that μ
v
(t) = μ
v
(0) = μ
v
(i.e. that the growth rate is constant throughout the selection) and that any amplification steps are uniform across all variants. This derivation is based on [16, 18]. In interference-free growth, the growth of individual variants can be described by the first order equation:
$$ \frac{d{c}_v^{\hbox{'}}}{dt}={\mu}_v{c}_v^{\hbox{'}}(t) $$
Therefore, the count for a variant at time t given the starting count is:
$$ {c}_v^{\hbox{'}}(t)={c}_{v,0}^{\hbox{'}}{e}^{\mu_vt}={f}_{v,0}^{\hbox{'}}{N}_0^{\hbox{'}}{e}^{\mu_vt} $$
We can write the ratio of variant counts in these terms and construct the continuous function M
v
″(t).
$$ \frac{c_v^{\hbox{'}}(t)}{c_{wt}^{\hbox{'}}(t)} = \frac{N_0^{\hbox{'}}\kern0.2em {f}_{v,0}^{\hbox{'}}{e}^{\mu_vt}}{N_0^{\hbox{'}}\kern0.2em {f}_{wt,0}^{\hbox{'}}{e}^{\mu_{wt}t}} = \frac{f_{v,0}^{\hbox{'}}}{f_{wt,0}^{\hbox{'}}}{e}^{\left({\mu}_v-{\mu}_{wt}\right)t} $$
$$ {M}_v^{{\prime\prime} }(t)= \log \left(\frac{c_v^{\hbox{'}}(t)}{c_{wt}^{\hbox{'}}(t)}\right)=\left({\mu}_v-{\mu}_{wt}\right)t+ \log \left(\frac{f_{v,0}^{\hbox{'}}}{f_{wt,0}^{\hbox{'}}}\right) $$
We convert to the discrete function M
v,t
″ for convenience by assuming that m timepoints are sampled at constant intervals, determined by the number of wild-type doublings (δ) per time point, such that max T = mδ. We then find that the expected score for growth experiments under the regression scoring model (β
v
″) should be related to the growth rate (μ
v
) by:
$$ {\beta}_v^{{\prime\prime} }=m\delta \left({\mu}_v-{\mu}_{wt}\right) $$
Generation of simulated datasets
Simulated datasets contain 10,000 unique variants (including wild-type), each characterized by a true variant effect: the probability of selection in each round (p
v
) for binding simulations or the growth rate (μ
v
) for growth simulations. We assume that the variant effect distribution is normal and set the wild-type effect to p
wt
= 0.5 and μ
wt
= 1. We set the wild-type effect at the 75th percentile of the distribution and set the standard deviation to 0.1. We draw 9999 variants from this distribution, with 0.05 < p
v
< 0.99 and 0.05 < μ
v
< 5.
In each case, the population size is 10 million, with a starting wild-type frequency of 1%. Starting counts for each variant are simulated using a log-normal distribution of variant counts in the input time point such that the mean variant input count is 990 and the standard deviation of the distribution is 0.4 [16, 47]. Starting counts are independently generated for each replicate.
For each replicate, the starting population undergoes five rounds of selection. The count of each variant after binding (k
v,t
) is generated using a binomial distribution with parameters n = c
v,t−1
' and p = p
v
. The count of each variant after growth (g
v,t
) is generated using a negative binomial distribution with parameters r = c
v,t−1
' and \( p={e}^{-{\mu}_v\Delta t} \), \( \Delta t=\frac{\delta \ln 2}{\mu_{wt}} \). For these simulations, δ = 2. The population count for each variant (c
v,t
') is obtained by performing weighted random sampling with replacement, where the weight for each variant is proportional to k
v,t
or g
v,t
and the total population size was 10 million.
Read counts for each variant (c
v,t
) are simulated by performing weighted random sampling with replacement, where the weight for each variant is proportional to the population counts (c
v,t
') and the average sequencing depth is 200 reads per variant (approximately 2 million reads per time point).
We simulate replicate noise by drawing a new variant effect from the variant effect distribution for 10% of variants (not including wild-type). These noisy variants were randomly chosen. This new variant effect was used to simulate one replicate and the other four replicates used the original effect. Noisy effects were split uniformly between the five replicates, such that 2% of the variants in each replicate were affected.
We simulate time point amplification and depletion noise by multiplying or dividing c
v,t
' by 50 before performing the sampling step to obtain c
v,t
. We randomly choose 10% of variants to be affected by noise, 5% subject to amplification and 5% subject to depletion, split uniformly among the five replicates. For each noisy variant in the chosen replicate, one time point (including input) was randomly chosen for amplification or depletion.
Python code for generating these simulated datasets is available as simdms v0.1 (DOI: 10.5281/zenodo.546311).
Deep mutational scan of Phospholipase A2
A region proximal to both lipid binding sites of the C2 domain of Phospholipase A2 (PLA2) was targeted for deep mutational scanning. Positions 94–97 of the C2 domain of mouse PLA2-alpha (ANYV) were fully randomized using a doped synthetic oligonucleotide. The library of C2 subdomains containing mutations was cloned into the AvrII and PpuMI sites of wild-type C2 domain in pGEM. The library was subcloned into phage arms and expressed on the surface of bacteriophage using the T7 phage display system according to the manufacturer’s instructions (Novagen T7Select 10-3b). The library was amplified in BLT5403 E. coli and variants were selected for their ability to bind to a lipid mixture containing ceramide 1-phosphate (C1P) [48]. The mouse PLA2-alpha cDNA was a generous gift from Michael Gelb, University of Washington. NiSepaharose Excel, capacity 10 mg/mL, was purchased from GE. Other reagents were purchased from Thermo-Fisher.
To select for C1P binding, lipid nanodiscs were developed as a bilayer affinity matrix. The His6-tagged membrane scaffold protein MSP1D1 [49] was expressed in BL21 E. coli from a pET28a plasmid and purified on nickel resin, then used to generate lipid nanodiscs comprising 30 mol% phosphatidylcholine, 20 mol% phosphatidylserine, 40 mol% phosphatidylethanolamine, and 10 mol% C1P [50]. To separate nanodiscs from large lipid aggregates and free protein, the mixture was subjected to gel filtration using a Superose 6 10/300 GL column (Pharmacia) and the major peak following the void volume was collected. To generate the affinity resin, 70 μg of nanodiscs (quantified by protein content) was incubated overnight at 4 °C with 10 μL nickel resin in 20 mM Tris pH 7.5 and 100 mM NaCl. The resin was washed twice in the same solution and used in phage binding reactions.
Phage expressing the C2 domain variant library were titered and diluted to a concentration of 5 × 109 pfu/mL in 20 mM Tris pH 7.5 and 100 mM NaCl, then incubated with lipid nanodisc affinity resin plus 10 μM calcium in a final volume of 350 μL. After a 2-hour incubation at 4 °C, the resin was washed four times in 1 mL of the incubation buffer containing 20 mM imidazole. Phage bound to nanodiscs were eluted with 20 mM Tris pH 7.5 containing 500 mM imidazole. Phage from the elution were titered, amplified, and subjected to additional rounds of selection. Three replicate selections were performed on different days using the same input phage library.
Sequencing libraries were prepared by PCR amplifying the variable region using primers that append Illumina cluster generating and index sequences (Additional file 7) before sequencing using the Illumina NextSeq platform with a NextSeq high output kit (75 cycles, FC-404-1005). Reads were demultiplexed using bcl2fastq v2.17 (Illumina) with the arguments bcl2fastq --with-failed-reads --create-fastq-for-index-reads --no-lane-splitting --minimum-trimmed-read-length 0 --mask-short-adapter-reads 0. Quality was assessed using FastQC v0.11.3 [51]. Demultiplexed reads are available in the NCBI Sequence Read Archive, BioProject Accession PRJNA344387.
Neuraminidase data analysis
Raw reads were demultiplexed using a custom script based on three-nucleotide barcodes provided by the original authors [30]. The reads were analyzed in Enrich2 v1.0.0 as ten experimental conditions: five non-overlapping 30-base regions of the neuraminidase gene in either the presence or absence of oseltamivir. Reads were required to have a minimum quality score of 23 at all positions and contain no Ns. The five mutagenized regions were scored independently and then merged to create a single set of variant scores for each treatment. To be consistent with the original study, we removed variants containing multiple nucleotide changes with the exception of p.Ile117Ter and p.Thr226Trp that were individually validated. The p values for comparing variant scores to wild-type in each treatment and comparing variant scores between treatments were calculated using a z-test. All three sets of p values were jointly corrected for multiple testing using the qvalue package in R [52], and variants with a q value of less than 0.05 were reported as significant.
Analysis of other datasets
For previously published datasets, raw sequence files in FASTQ format were obtained from the respective authors. Datasets (Table 1) were analyzed independently using Enrich2 v1.0.0. The BRCA1 dataset was analyzed in a single run with separate experimental conditions for the yeast two-hybrid and phage display assays. For all datasets except neuraminidase, reads were required to have a minimum quality score of 20 at all positions and contain no Ns.
For the WW domain sequence function map (Fig. 1), scores and standard errors were calculated using weighted least squares linear regression in two technical replicates and the replicates were combined using the random-effects model as described.
Enrich2 software implementation
Enrich2 is implemented in Python 2.7 and requires common dependencies for scientific Python. The graphical user interface is implemented using Tkinter. A deep mutational scanning experiment is represented as a tree of objects with four levels: experiment; condition; selection; and sequencing library. Each object’s data and metadata are stored in a single HDF5 file, including intermediate values calculated during analysis.
Enrich2 is designed to be run locally on a laptop computer and does not require a high-performance computing environment. Most analyses can be run overnight (Table 1). Run times in Table 1 were measured using a MacBook Pro Retina with 2.8 GHz Intel Core i7 processor and 16GB of RAM.