Depletion of Abundant Sequences by Hybridization (DASH): using Cas9 to remove unwanted high-abundance species in sequencing libraries and molecular counting applications

Next-generation sequencing has generated a need for a broadly applicable method to remove unwanted high-abundance species prior to sequencing. We introduce DASH (Depletion of Abundant Sequences by Hybridization). Sequencing libraries are ‘DASHed’ with recombinant Cas9 protein complexed with a library of guide RNAs targeting unwanted species for cleavage, thus preventing them from consuming sequencing space. We demonstrate a more than 99 % reduction of mitochondrial rRNA in HeLa cells, and enrichment of pathogen sequences in patient samples. We also demonstrate an application of DASH in cancer. This simple method can be adapted for any sample type and increases sequencing yield without additional cost. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0904-5) contains supplementary material, which is available to authorized users.


Background
The challenge of extracting faint signals from abundant noise in molecular diagnostics is a recurring theme across a broad range of applications. In the case of RNA sequencing (RNA-Seq) experiments specifically, there may be several orders of magnitude difference between the most abundant species and the least. This is especially true for metagenomic analyses of clinical samples like cerebrospinal fluid (CSF), whose source material is inherently limited [1], making enrichment or depletion strategies impractical or impossible to employ prior to library construction. The presence of unwanted highabundance species, such as transcripts for the 12S and 16S mitochondrial ribosomal RNAs (rRNAs), effectively increases the cost and decreases the sensitivity of counting-based methodologies.
The same issue affects other molecular clinical diagnostics. In cancer profiling, the fraction of the mutant tumor-derived species may be vastly outnumbered by wild-type species due to the abundance of immune cells or the interspersed nature of some tumors throughout normal tissue. This problem is profoundly exaggerated in the case of cell-free DNA/RNA diagnostics, whether from malignant [2,3], transplant [4], or fetal sources [5,6], and relies on brute force counting by either sequencing or digital PCR (dPCR) [7] to yield a detectable signal. For these applications, a technique to deplete specific unwanted sequences that is independent of sample preparation protocols and agnostic to measurement technology is highly desired.
CRISPR (clustered regularly interspaced short palindromic repeats) and Cas (CRISPR associated) nucleases, such as Cas9, function in bacterial adaptive immune systems to remove incoming phage DNA from the host without harm to the bacteria's own genome. The CRISPR-Cas9 system has attained widespread adoption as a genome editing technique [8][9][10][11]. When coupled with single guide RNAs (sgRNAs) designed against targets of interest, Streptococcus pyogenes Cas9 binds to 3' NGG protospacer adjacent motif (PAM) sites and produces double-stranded breaks if the sgRNA successfully hybridizes with the adjacent target sequence (Fig. 1a). In vitro, Cas9 may be used to cut DNA directly, in a manner analogous to a conventional restriction enzyme, except that the target sequence (outside of the PAM site) may be programmed at will and massively multiplexed without significant off-target effects. This affords the unique opportunity to target and prevent amplification of undesired sequences, such as those that are generated during next-generation sequencing (NGS) protocols.
In this paper, we have exploited the unique properties of Cas9 to selectively deplete unwanted high-abundance sequences from existing RNA-Seq libraries. We refer to this approach as Depletion of Abundant Sequences by Hybridization (DASH). Employing DASH after transposon-mediated fragmentation but prior to the following amplification step (which relies on the presence of adaptor sequences on both ends of the fragment) prevents amplification of the targeted sequences, thus ensuring they are not represented in the final sequencing library (Fig. 1b). We show that this technique preserves the representational integrity of the non-targeted sequences while increasing overall sensitivity in cell line samples and human metagenomic patient samples. Further, we demonstrate the utility of this system in the context of cancer detection, in which depletion of wild-type sequences increases the detection limit for oncogenic mutant sequences. The DASH technique may be used to deplete specific unwanted sequences from existing Illumina sequencing libraries, PCR amplicon libraries, plasmid collections, phage libraries, and virtually any other existing collection of DNA species.
Existing specific sequence enrichment techniquessuch as pull-down methods [5,[12][13][14], amplicon-based methods [3,15], molecular inversion methods [16][17][18], COLD-PCR [19], competitive allele-specific TaqMan PCR (castPCR) [20], and the classic method of using restriction enzyme digestion on mutant sites [21] can effectively enrich for targets in sequencing libraries, but these are not useful for discovery of unknown or unpredicted sequences. Brute force counting methods also exist, such as dPCR [3,7], but they are not easy to multiplex across a large panel. While high-throughput sequencing of select regions can be highly multiplexed to detect rare and novel mutations, and barcoded unique identifiers can overcome sequencing error noise [22], it is costly since the vast majority of the sequencing reads map to non-informative wild-type sequences. A number of sequence-specific RNA depletion methods also currently exist. Illumina's Ribo-Zero rRNA Removal Kit and Ambion's GLOBINclear Kit pull rRNAs and globin mRNAs, respectively, out of total RNA samples using sequence-specific oligos conjugated to magnetic beads. RNAse H-based methods, such as New England BioLab's NEBNext rRNA Depletion Kit similarly mark abundant RNA species with sequence-specific DNA oligos, and then subject them to degradation by RNAse H, which digests RNA/DNA hybrid molecules [23]. These methods are all employed prior to the start of library prep, and are limited to samples containing at least 10 ng to 1 μg of RNA. DASH, in contrast, depletes abundant species after complementary DNA (cDNA) amplification, and thus can be utilized for essentially any amount of input sample.

Results
We demonstrate deletion of unwanted mitochondrial rRNA using DASH first on HeLa cell line RNA (Fig. 2) and then on CSF RNA from patients with pathogens in their CSF (Fig. 3), in order to increase sequencing bandwidth of useful data. Selection of rRNA sgRNA targets was based on examining coverage plots for standard RNA-Seq experiments on HeLa cells as well as on several patient CSF samples. Coverage of the 12S and 16S mitochondrial rRNA genes was consistently several orders of magnitude higher than the rest of the mitochondrial and non-mitochondrial genes (Figs. 2c and 3). We chose 54 sgRNA target sites within this high-coverage region of the mitochondrial chromosome, situated approximately every 50 bp over a 2.5 kb region (sequences listed in Additional file 1). sgRNA sites are indicated by red arrowheads in Fig. 2b. sgRNAs for these sites were generated as described in the "Materials and methods" section.
To calculate the input ratio of Cas9 and sgRNA to sample nucleic acid, we estimated that 90 % of each sample was comprised of the rRNA regions that we targeted; thus, our potential substrate makes up 4.5 ng of a 5 ng sample. This corresponds to a target site concentration of 13.8 nM in the 10 μL reaction volume. To assure the most thorough Cas9 activity possible, and given that Cas9 is a single-turnover enzyme in vitro [24], we used a 100-fold excess of Cas9 protein and a 1000-fold excess of sgRNA relative to the target. Thus, each 10 μL sample of cDNA generated from a CSF sample contained a final concentration of 1.38 μM Cas9 protein and 13.8 μM sgRNA. In the case of HeLa cDNA, we used only 1 ng per sample, and therefore decreased the Cas9 and sgRNA concentrations by fivefold. However, since mitochondrial rRNA sequences represented only approximately 60 % of the HeLa samples (compared with approximately 90 % for CSF), the HeLa samples contained 150-fold Cas9 and 1500-fold sgRNA. To examine dose response, we processed additional 1 ng HeLa samples treated with 15-fold Cas9 and 150-fold sgRNA. Both concentrations were done in triplicate (Additional file 2).

Reduction of unwanted abundant sequences in HeLa samples
We first demonstrate the utility and efficacy of our approach using sequencing libraries prepared from total RNA extracted from HeLa cells. In the untreated samples, reads mapping to 12S and 16S mitochondrial rRNA genes represent 61 % of all uniquely mapped human reads. After DASH treatment, these sequences are reduced to only 0.055 % of those reads (Fig. 2a, b). Comparison of gene-specific fragments per kilobase of transcript per million mapped reads (fpkm) values between treated and untreated samples reveals mean 82fold and 105-fold decreases in fpkm values for 12S and 16S rRNA, respectively, in the samples treated with 150fold Cas9 and 1500-fold sgRNA (Fig. 2c). Similarly, the samples treated with 15-fold Cas9 and 150-fold sgRNA show 30-and 45-fold reductions in 12S and 16S fpkm values, respectively, indicating a dose-dependent response to DASH treatment (Additional file 2).

Enrichment of non-targeted sequences and analysis of off-target effects in HeLa samples
This profound depletion of abundant 12S and 16S transcripts increases the available sequencing capacity for the remaining, untargeted transcripts. We quantify this increase by the slope of the regression line fit to the remaining genes, showing a 2.38-fold enrichment in fpkm values for all untreated transcripts. An R 2 coefficient of 0.979 for this regression line indicates strong consistency between replicates with minimal off-target effects (Fig. 2c).
To confirm that our depletion was specific to only the targeted mitochondrial sequences, we calculated the changes in fpkm values across all genes in the treated and untreated samples and identified those genes that were significantly diminished (>2 standard deviations) relative to their control values. To overcome issues with stochastic variation at low gene counts/fpkm, we eliminated those genes that, between the three technical replicates at each Cas9 concentration, showed standard deviations in fpkm values greater than 50 % of the mean. All of the genes meeting this criterion were present at  Additional sequence specificity is conferred by a single guide RNA (sgRNA) with a 20 nucleotide hybridization domain. DNA double strand cleavage occurs three nucleotides upstream of the PAM site. b Depletion of Abundant Sequences by Hybridization (DASH) is used to target regions that are present at a disproportionately high copy number in a given next-generation sequencing library following tagmentation or flanking sequencing adaptor placement. Only non-targeted regions that have intact adaptors on both ends of the same molecule are subsequently amplified and represented in the final sequencing library less than 15 fpkm. Of the remaining genes, only one non-targeted human gene, MT-RNR2-L12, showed significant depletion when compared with the un-treated samples (Fig. 2c). MT-RNR2-L12 is a pseudogene and shares over 90 % sequence identity with a portion of the 16S mitochondrial rRNA gene. Out of the 24 sgRNA sites within the homologous region, 16 of them retain intact PAM sites in MT-RNR2-L12. Of these, seven have perfectly matching 20mer sgRNA target sites, and the remaining nine each have between one and four mutations (Additional file 3). Depletion of this gene is, therefore, an expected consequence of our sgRNA choices.

Reduction of unwanted abundant sequences in CSF samples
We next tested the utility of our method when applied to clinically relevant samples. In the case of pathogen detection in patient samples, the microbial transcripts are typically low in number and become greatly outnumbered by human host sequences. As a result, sequencing depth must be drastically increased to confidently detect such small minority sequence populations. We reasoned that depletion of unwanted high-abundance sequences from patient libraries could result in increased representation of pathogen-specific sequence reads. We thus integrated the DASH method with our in-house metagenomic deep sequencing diagnostic pipeline for patients with meningeal inflammation (i.e., meningitis) or brain inflammation (i.e., encephalitis) likely due to an infectious agent or pathogen. Figure 3 and Table 1  In the case of a patient with meningoencephalitis whose CSF was previously shown to be infected with the amoeba Balamuthia mandrillaris [25] (patient 1), diagnosis was originally made by identification of a small fraction (<0.1 %) of reads aligning to specific regions of the B. mandrillaris 16S mitochondrial gene. After DASH treatment, human mitochondrial 12S and 16S genes were reduced by more than an order of magnitude, and . Across all cases, the DASH technique significantly reduced the coverage of human 12S and 16S genes by an average of 7.5-fold while increasing the coverage depth for pathogenic sequences by an average 5.9-fold. See Table 1 for relevant data and Taenia solium (pork tapeworm; patient 3) infections showed 2-and 3.9-fold increases in coverage of the 18S genes of C. neoformans and T. solium, respectively, the detection of which was crucial in the initial diagnoses. The observed increases in relative signal can be translated into either a sequencing cost savings or a higher sensitivity that may be useful clinically for earlier detection of infections.
Reduction of wild-type background for detection of the KRAS G12D (c.35G>A) mutation in human cancer samples Specific driver mutations known to promote cancer evolution and at times to make up the genetic definition of malignant subtypes are important for diagnosis and targeted therapeutics (i.e., precision medicine). In complex samples isolated from biopsies or cell-free body fluids such as plasma, wild-type DNA sequences often overwhelm the signal from mutant DNA, making the application of traditional Sanger sequencing challenging [2,3,26]. For NGS, detection of minority alleles requires additional sequencing depth and therefore increases cost. We reasoned that the DASH technique could be applied to increase mutation detection from a PCR amplicon derived from a patient sample. We chose to focus on depletion of the wild-type allele of KRAS at the glycine 12 position, a hotspot of frequent driver mutations across a variety of malignancies [27][28][29]. This is an ideal site for DASH because all codons encoding the wild-type glycine residue contain a PAM site (NGG), while any mutation that alters that residue (e.g., c.35G>A, p.G12D) ablates the PAM site and is thus uncleavable by Cas9 (Fig. 4a). This will be true of any mutation that changes a glycine (codons GGA, GGC, GGG, and GGT) or a proline (codons CCA, CCC, CCG, and CCT) to any other amino acid. Furthermore, it is relevant to the ubiquitous C>T nucleotide change found in germline mutations as well as somatic cancer mutations [30]. Targeting of other mutations will likely be possible in the near future with reengineered CRISPR nucleases or those that come from alternative species and have different PAM site specificities [31,32]. The sequence of the sgRNA designed to target the KRAS G12D PAM site is listed in Additional file 1, as is the non-human sequence used for the negative control sgRNA. Both were transcribed from a DNA template by T7 RNA polymerase, purified, and complexed with Cas9 as described in the "Materials and methods" section. Samples were prepared by mixing sheared genomic DNA from a healthy individual (with wild-type KRAS genotype confirmed with dPCR) and KRAS G12D genomic DNA to achieve mutant to wild-type allelic ratios of 1:10, 1:100, and 1:1000, and 0:1. For each mixture, 25 ng of a DNA was incubated with 25 nM Cas9 pre-complexed with 25 nM of sgRNA targeting KRAS G12D. This concentration is high relative to the concentration of target molecules, but empirically we found it to be the most efficient ratio. We hypothesize that this may be due to non-cleaving Cas9 interactions with the rest of the human genome [24], which effectively reduce the Cas9 concentration at the cleavage site.
Samples were subsequently heated to 95°C for 15 min in a thermocycler to deactivate Cas9 ("Materials and methods"). Droplet digital PCR (ddPCR) was used to count wild-type and mutant alleles using the primers and TaqMan probes depicted in Fig. 4a and described in the "Materials and methods" section. All samples were processed in triplicate. Samples incubated with or without Cas9 complexed to a non-human sgRNA target show the expected percentages of mutant allele: approximately 10 %, 1 %, and 0.1 % for the 1:10, 1:100, and 1:1000 initial mixtures respectively (Fig. 4b). With addition of Cas9 targeted to KRAS, the wild-type allele count drops nearly two orders of magnitude (purple bars in Fig. 4b), while virtually no change is observed in number of mutant alleles (blue bars). This confirms the high specificity of Cas9 for the NGG of the PAM site.
With the addition of DASH targeted to KRAS G12, the percentage of mutant allele jumps from 10 % to 81 %, from 1 % to 30 %, and from 0.1 % to 6 % (Fig. 4c). This corresponds to 8.1-fold, 30-fold and 60-fold representational increases for the mutant allele, respectively. As expected, there was virtually no detection of mutant alleles in the wild-type-only samples both with and without DASH treatment (one droplet in one of three no DASH wild-type-only samples).

Discussion
In this paper we have introduced DASH, a technique that leverages in vitro Cas9 ribonucleoprotein (RNP) activity to deplete specific unwanted high-abundance sequences, which results in the enrichment of rare and less abundant sequences in NGS libraries or amplicon pools.
While the procedure may be easily generalized, we developed DASH to address current limitations in metagenomic pathogen detection and discovery, where the sequence abundance of an etiologic agent may be present as a minuscule fraction of the total. For example, infectious encephalitis is a syndrome caused by well over 100 pathogens ranging from viruses, fungi, bacteria and parasites. Because of the sheer number of diagnostic possibilities and the typically low pathogen load present in CSF, more than half of encephalitis patients never have an etiologic agent identified [33]. We have demonstrated that NGS is a powerful tool for identifying infections, but as the B. mandrillaris meningoencephalitis case demonstrates, the vast majority of sequence reads are "wasted" re-sequencing high abundance human transcripts. In this case, we have shown that DASH depletes with incredible specificity the small number of human rRNA transcripts that comprise the bulk of the NGS library, thereby lowering the required sequencing depth to detect non-human sequences and enriching the proportion of non-human (Balamuthia) reads in the metagenomic dataset. In this study, we have targeted mitochondrial rRNA species because we have consistently observed them to be the most abundant sequences in these CSF-derived RNA samples. For other types of tissues, alternative programming of DASH for removal of nuclear rRNA species or essentially any other abundant sequences would be warranted.
In the case of infectious agents, it is possible to directly enrich rare sequences by hybridization to DNA microarrays [34] or beads [12]. However, these approaches rely on sequence similarity between the target and the probe and therefore may miss highly divergent or unanticipated species. Furthermore, the complexity and cost of these approaches will continue to increase with the known spectrum of possible agents or targets. In contrast, the identity and abundance of unwanted sequences in most human tissues and sample types has been well described in scores of previous transcriptome profiling projects [23], and therefore optimized collections of sgRNAs for DASH depletion are likely to remain stable.
A number of methods for depleting ribosomal RNA from RNA-Seq libraries exist in the form of commercially available kits. We assert that DASH is equally effective or better than these methods on four metrics: (1) input requirements, (2) performance, (3) programmability, and (4) cost. These can be assessed based on information available on company websites or in publications for three major competing techniques: Illumina's Ribo-Zero and Thermo Fisher's RiboMinus, which both use biotinylated capture Fig. 4 a DASH is used to selectively deplete one allele while keeping the other intact. An sgRNA in conjunction with Cas9 targets a wild-type (WT) KRAS sequence. However, since the G12D (c.35G>A) mutation disrupts the PAM site, Cas9 does not efficiently cleave the mutant KRAS sequence. Subsequent amplification of all alleles using flanking primers, as in the case of digital PCR, Sanger sequencing, or high-throughput sequencing, is only effective for non-cleaved and mutant sites. b Three human genomic DNA samples with varying ratios of wild-type to mutant (G12D) KRAS were treated either with KRAS-targeted DASH, a non-human control DASH, or no DASH. Counts of intact wild-type and G12D sequences were then measured by droplet digital PCR (ddPCR). c Same data as in (b), presented as percentage of mutant sequences detected. Inset shows fold enrichment of the percentage of mutant sequences with KRAS-targeted DASH versus no DASH. For both (b) and (c), values and error bars are the average and standard deviation, respectively, of three independent experiments probes for depletion; and New England Biolab's NEBNext rRNA depletion kit, which uses RNAse H for depletion.

Input requirements
Illumina recommends 1 μg of total RNA as input for Ribo-Zero, but also has a low-input protocol requiring only 100 ng [35]. ThermoFisher recommends 2-10 μg of total RNA for its standard RiboMinus protocol [36], and 100 ng to 1 μg for its Low Input RiboMinus Eukaryote System v.2 [37]. NEB recommends 10 ng to 1 μg total RNA input for the NEBNext rRNA Depletion Kit [38]. The reason for these stringent amount requirements is that these three methods all deplete samples at the RNA stage. DASH, in contrast, avoids the need to delicately manipulate the original sample. Instead, DASH is employed after cDNA synthesis and library generation; thus, it can be performed on any library, without regards to starting total RNA amount, or the manner in which the library was constructed (tagmentation or otherwise). For scarce and precious samples, such as patient CSF, often less than 10 ng of total cDNA is available even after NuGEN Ovation amplification; prior to this work, no commercial depletion method was available for these samples.

Performance
All commercial rRNA depletion methods promise at least 85 % reduction in reads of the sequences they target. Illumina states that the Ribo-Zero technique can achieve between 85 % and >99 % reduction in the rRNA sequences it targets [35]; RiboMinus states 95-98 % reduction [39]; and NEBNext states 95-99 % reduction [38]. Adiconis et al. [23] compared several RNA-Seq methods and reported on many metrics, including depletion of rRNA sequences. Ribosomal RNA sequences comprised 84.7 % of reads in their un-depleted sample (100 ng total RNA from K-562 cells), while Ribo-Zero reduced this to 11.3 % (an 86.7 % reduction), and RNAse H reduced it to 0.1 % (a 99.9 % reduction). In this paper, we show that DASH decreases the mitochondrial rRNA reads in HeLa total RNA from 61 % to 0.055 % (99.9 % reduction). Adiconis et al. obtained similar numbers from 1 μg total RNA samples from formalin-fixed paraffin-embedded (FFPE) kidney tissue (78.2 % and 99.9 % reduction for Ribo-Zero and RNAse H, respectively) and pancreas tissue (73.0 % and 99.7 % reduction for Ribo-Zero and RNAse H, respectively). This is comparable to DASH reduction in three patient CSF samples (82.1 %, 81.4 % and 88.2 % reduction). However, it is important to note again that Adiconis et al. used 1 μg total RNA from tissue samples, while the DASHed CSF samples consisted of only 5 ng of NuGEN Ovation-amplified cDNA (total RNA content in the original CSF samples was too low to accurately quantify).
Another important measure of performance is maintenance of relative abundances of non-targeted sequences, such as the human transcriptome. Correlation coefficients for samples with and without DASH treatment ranged from R 2 = 0.979 to 0.994 in this study ( Fig. 2; Additional file 4), slightly higher than those found by Adiconis et al. for all methods [23].
Programmability DASH can be adapted to target any sequence containing a PAM site; construction of new sgRNAs is facile and inexpensive (see "Materials and methods" section). Because it is employed after sequencing adapter addition, DASH's utility is not limited to RNA-Seq; it can be applied to any library type. Examples include ATAC-Seq libraries, in which desired nuclear DNA is contaminated with a significant amount of mitochondrial DNA sequences, and microbiome sequencing, where it may be desirable to eliminate a particularly abundant species in order to better sample the underlying diversity. Since Ribo-Zero, RiboMinus and NEBNext are all proprietary kits, they cannot easily be re-programmed by the user to target other sites.

Cost
Based on current publicly available list prices of the most economical kit sizes, the per-sample costs (in US dollars) of the kits discussed here are $82.00 (Ribo-Zero Gold Kit H/M/R) [35], $93.67 (RiboMinus Human/Mouse Transcriptome Isolation Kit) [36] and $45.00 (NEBNext rRNA Depletion Kit H/M/R) [38]. In contrast, we calculate the cost of DASH at less than $4 per sample when Cas9 and T7 RNA polymerase are made in-housea very sensible solution for labs that are already spending large amounts of money on NGS. Where Cas9 production is not possible, DASH can still be carried out using commercially available Cas9 protein.
DASH may also enhance the detection of rare mutant alleles that are important for liquid biopsy cancer diagnostics. Allelic depletion with DASH increases the signal (oncogenic mutant allele) to noise (wild-type allele) by more than 60-fold when studying the KRAS hotspot mutant p.G12D. Other approaches for enriching lowabundance mutations exist, such as restriction enzyme digestion and COLD-PCR. However, these methods are limited when large mutation panels are required. Here we have described a single application for DASH in cancer, but the utility of this method will be fully realized by multiplexing large panels of mutation sites, using guide RNAs and PAM sites as a way to essentially create programmable restriction enzymes that can be used in a single pool. With the rapidly growing number of oncologic therapies that target particular cancer mutations, sensitive and non-invasive techniques for cancer allele detection are increasingly relevant for optimizing patient care [26]. These same techniques are also becoming increasingly important for diagnosis of earlier stage (and generally more curable) cancers as well as the detection of cancer recurrence without needing to re-biopsy the patient [2,14,[36][37][38].
The potential applications of DASH are manifold. Currently, DASH can be customized to deplete any set of defined PAM-adjacent sequences by designing specific libraries of sgRNAs. Given the popularity and promise of CRISPR technologies, we anticipate the adaptation and/or engineering of CRISPR-associated nucleases with more diverse PAM sites [31,32,40]. A portfolio of next-generation Cas9-like nucleases would further enable DASH to deplete large and diverse numbers of arbitrarily selected alleles across the genome without constraint. We envision that DASH will be immediately useful for the development of non-invasive diagnostic tools, with applications to low input samples or cell-free DNA, RNA, or methylation targets in body fluids [4,6,[41][42][43][44][45].
Many other NGS applications could also benefit from depletion of specific sequences, including hemoglobin mRNA depletion for RNA-Seq of blood samples [46] and tRNA depletion for ribosome profiling studies. Depletion of pseudogenes or otherwise homologous sequences by small but consistent differences in sequences is also theoretically possible, and may serve to remove ambiguities in clinical high-throughput sequencing. Using DASH to enrich for minority variations in microbial samples may enable early discovery of pathogen drug resistance. Similarly, the application of DASH to the analysis of cell-free DNA may augment our ability to detect early markers of drug resistance in tumors [26].

Conclusions
Here, we have demonstrated the broad utility of DASH to enhance molecular signals in diagnostics and its potential to serve as an adaptable tool in basic science research. While the degree of regional depletion of mitochondrial rRNA was sufficient for our application, the depletion parameters were not maximized: we used only 54 sgRNA target sites out of about 250 possible S. pyogenes Cas9 sgRNA candidates in the targeted mitochondrial region. Future studies will explore the upper limit of this system while elucidating the most effective sgRNA and CRISPR-associated nuclease selections, which will likely differ based on target and application. Irrespective, depletion of unwanted sequences by DASH is highly generalizable and may effectively lower costs and increase meaningful output across a broad range of sequence-based approaches.

Generation of cDNA from HeLa cell line and clinical samples
CSF samples were collected under the approval of the institutional review boards of the University of California San Francisco and San Francisco General Hospital. Samples were processed for high-throughput sequencing as previously described [1,25]. Briefly, amplified cDNAs were made from randomly primed total RNA extracted from 250 μL of CSF or 250 pg of HeLa RNA using the NuGEN Ovation v.2 kit (NuGEN, San Carlos, CA, USA) for low nucleic acid content samples. A Nextera protocol (Illumina, San Diego, CA, USA) was used to add on a partial sequencing adapter on both sides.

In vitro preparation of the CRISPR/Cas9 complex
The Cas9 expression vector, containing an N-terminal MBP tag and C-terminal mCherry, was kindly provided by Dr. Jennifer Doudna. The protein was expressed in BL21 Rosetta cells for three hours at 18°C. Cells were pelleted and frozen. Upon thawing, cells from a 4 L culture preparation were resuspended in 50 mL of lysis buffer (50 mM sodium phosphate pH 6.5, 350 mM NaCl, 1 mM TCEP (tris(2-carboxyethyl)phosphine), 10 % glycerol) supplemented with 0.5 mM EDTA, 1 μM PMSF (phenylmethanesulfonyl), and a single Roche complete EDTA-free protease inhibitor tablet (Roche Diagnostics, Indianapolis, IN, USA) and passed through an HC-8000 homogenizer (Microfluidics, Westwood, MA, USA) five times. The lysate was clarified by centrifugation at 20,000 rpm for 45 min at 4°C and then filtered through a 0.22 μm vacuum filtration unit. The filtered lysate was loaded onto three 5 mL HiTrap Heparin HP columns (GE Healthcare, Little Chalfont, UK) arranged in series on a GE AKTA Pure system. The columns were washed extensively with lysis buffer, and the protein was eluted with a gradient of lysis buffer to buffer B (lysis buffer supplemented with NaCl up to 1.5 M). The resulting fractions were analyzed by Coomassie gel, and those containing Cas9 (centered around the point on the gradient corresponding to 750 mM NaCl) were combined and concentrated down to a volume of 1 mL using 50 K MWCO Amicon Ultra-15 Centrifugal Filter Units (EMD Millipore, Billerica, MA, USA) and then fed through a 0.22 μm syringe filter. Using the AKTA Pure, the 1 mL of filtered protein solution was then injected onto a HiLoad 16/600 Superdex 200 size exclusion column (GE Healthcare, Little Chalfont, UK) pre-equilibrated with buffer C (lysis buffer supplemented with NaCl up to 750 mM). Resulting fractions were again analyzed by Coomassie gel, and those containing purified Cas9 were combined, concentrated, supplemented with glycerol up to a final concentration of 50 %, and frozen at −80°C until use. Protein concentration was determined by BCA assay. Yield was approximately 80 mg from 4 L of bacterial culture.
sgRNA target sites were selected as described in the main text. DNA templates for sgRNAs based on an optimized scaffold [47] were made with a similar method to that described in [48]. For each chosen target, a 60mer oligo was purchased including the 18-base T7 transcription start site, the targeted 20mer, and the first 22 bases of the tracr RNA (5′-TAATACGACTCACTATAGNN NNNNNNNNNNNNNNNNNNGTTTAAGAGCTATG CTGGAAAC-3′). This was mixed with a 90mer representing the 3′ end of the sgRNA on the opposite strand (5′-AAAAAAAGCACCGACTCGGTGCCACTTTTTC AAGTTGATAACGGACTAGCCTTATTTAAACTTGC TATGCTGTTTCCAGCATAGCTCTTA-3′). DNA templates for T7 sgRNA transcription were then assembled and amplified with a single PCR reaction using primers 5′-TAATACGACTCACTATAG-3′ and 5′-AAAAAA AGCACCGACTCGGTGC-3′. The resulting 131 base pair (bp) transcription templates, with the sequence 5′-TAATACGACTCACTATAGNNNNNNNNNNNNN NNNNNNNGTTTAAGAGCTATGCTGGAAACAGCA TAGCAAGTTTAAATAAGGCTAGTCCGTTATCAAC TTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT-3′, were pooled (for the mitochondrial rRNA library), or transcribed separately (for the KRAS experiments). All oligos were purchased from IDT (Integrated DNA Technologies, Coralville, IA, USA).

CRISPR/Cas9 treatment
To form the ribonucleoprotein (RNP) complex, Cas9 and the sgRNAs were mixed at the desired ratio with Cas9 buffer (final concentrations of 50 mM Tris pH 8.0, 100 mM NaCl, 10 mM MgCl 2 , and 1 mM TCEP), and incubated at 37°C for 10 min. This complex was then mixed with the desired amount of sample cDNA in a total of 20 μL, again in the presence of Cas9 buffer, and incubated for 2 h at 37°C.
Since Cas9 has high nonspecific affinity for DNA [24] it was necessary to disable and remove the Cas9 before continuing. For the rRNA depletion samples, 1 μL (at >600 mAU/mL) of Proteinase K (Qiagen, Hilden, Germany) was added to each sample which was then incubated for an additional 15 min at 37°C. Samples were then expanded to a volume of 100 μL and purified with three phenol:chloroform:isoamyl alcohol extractions followed by one chloroform extraction in 2 mL Phaselock Heavy tubes (5prime, Hilden, Germany). We added 10 μL of 3 M sodium acetate pH 5.5, 3 μL of linear acrylamide and 226 μL of 100 % ethanol to the 100 μL aqueous phase of each sample. Samples were cooled on ice for 30 min. DNA was then pelleted at 4°C for 45 min, washed once with 70 % ethanol, dried at room temperature and resuspended in 10 μL water.
In the case of the KRAS samples, Cas9 was disabled by heating the sample at 95°C for 15 min in a thermocycler and then removed by purifying the sample with a Zymo DNA Clean & Concentrator-5 kit (Zymo Research, Irvine, CA, USA).

High-throughput sequencing and analysis of sequencing data
Tagmented samples with and without DASH treatment underwent 10-12 cycles of additional amplification (Kapa Amplification Kit, Kapa Biosystems, Wilmington, MA, USA) with dual-indexing primers. A BluePippin instrument (Sage Science, Beverly, MA, USA) was used to extract DNA between 360 and 540 bp. Sequencing libraries were purified using the Zymo DNA Clean & Concentrator-5 kit and amplified again on an Opticon qPCR machine (MJ Research, Waltham, MA, USA) using a Kapa Library Amplification Kit until the exponential portion of the quantitative PCR signal was found. Sequencing libraries were then pooled and re-quantified with a ddPCR Library Quantification Kit (Bio-Rad, Hercules, CA, USA). Sequencing was performed on portions of one lane in an Illumina HiSeq 4000 instrument using 135 bp paired-end sequencing.
All reads were quality filtered using PriceSeqFilter v.1.2 [51] such that only read pairs with less than five ambiguous base calls (defined as Ns or positions with <95 % confidence based on Phred score) were retained. Filtered reads were aligned to the hg38 build of the human genome using the STAR aligner (v.2.4.2a) [52]. The number of mapped reads per gene and fpkm values were calculated using the exon length and sequence information encoded in the Gencode v.23 primary annotations (GTF file). Library complexity was determined by calculating the reduction in library size after clustering using the cd-hit-dup package [53,54]. Pathogen-specific alignments to 16S and 18S sequences were accomplished using Bowtie2 [55]. Per-nucleotide coverage was calculated from alignment (SAM/BAM) files using the SAMtools suite [56] and analyzed with custom iPython [57] scripts utilizing the Pandas data package. Plots were generated with Matplotlib [58].

dPCR of KRAS mutant DNA
KRAS wild-type DNA was obtained from a healthy consenting volunteer. The sample sat until cell separation occurred, and DNA was extracted from the buffy coat with the QIAamp Blood Mini Kit (Qiagen, Hilden, Germany). KRAS G12D genomic DNA from the human leukemia cell line CCRF-CEM was purchased from ATCC (Manassas, VA, USA). All DNA was sheared to an average of 800 bp using a Covaris M220 (Covaris, Woburn, USA) following the manufacturer's recommended settings. Cas9 reactions occurred as described above.

Ethics
CSF samples, as well as a whole blood sample for the KRAS negative control, were collected under the approval of the institutional review boards of the University of California San Francisco and San Francisco General Hospital (IRB number 13-12236). All experimental methods comply with the Helsinki Declaration.

Availability of data and materials
All sequencing data for human subjects has been deposited to NCBI's database of Genotypes and Phenotypes (dbGaP) and can be accessed at http://www.ncbi.nlm.nih.gov/gap by entering study accession number phs001067.v1.p1. Sequencing data for HeLa samples has been deposited as a separate BioProject in NCBI's Sequence Read Archive (SRA) and can be found at http://www.ncbi.nlm.nih.gov/bioproject by entering study accession number PRJNA311047. Reagents are available upon request from J.L.D.