Skip to main content
Fig. 4 | Genome Biology

Fig. 4

From: DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies

Fig. 4

Effects of bottlenecks on variant count distributions and fitness scores. a Input sample count distributions of previously published DMS experiments [20, 50]. For FOS and FOS-JUN datasets, counts of single AA variants with one, two, or three nucleotide substitutions in the same codon are shown. For the tRNA dataset, all variants with one, two, or three nucleotide substitutions are shown. Wild-type counts are indicated by the black dashed line. Expected count frequencies purely due to sequencing errors are indicated by red and green dashed lines for single and double nucleotide substitution variants, respectively. Black arrows indicate sets of variants that have likely not been assayed but whose sequencing reads are arising due to sequencing errors. b Simulation of bottlenecks at various steps of the DMS workflow based on a previously published DMS dataset [6]. Scatterplots show input and output sample counts for variants with one or two nucleotide substitutions in the original data or after simulating 3% library, replicate, or DNA extraction bottlenecks (from left to right). Hexagon color indicates the number of nucleotide substitutions and fill number of variants per 2d bin (see legend). Black arrows indicate sets of double nucleotide variants whose sequencing reads solely originate from sequencing errors. Dotted (or dashed) horizontal/vertical lines indicate soft (or hard) variant count thresholds used in downstream DiMSum analyses (see c). c Comparison of fitness scores from simulated datasets with (y-axis) or without (x-axis) the indicated bottlenecks. Variants are categorized by their robustness to filtering with hard (variants have to appear above the threshold in all replicates) or soft thresholds (variants have to appear above the threshold in at least one replicate) of 10 read counts. For the DNA extraction bottleneck, read count thresholds were also applied to output samples. Pearson correlation coefficients are indicated. The dashed line indicates the relationship y = x. Note that correlation coefficients are lower for soft than hard thresholds, because a subset of variants has fewer replicate measurements

Back to article page