STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions

Sequence Read Archive submissions to the National Center for Biotechnology Information often lack useful metadata, which limits the utility of these submissions. We describe the Sequence Taxonomic Analysis Tool (STAT), a scalable k-mer-based tool for fast assessment of taxonomic diversity intrinsic to submissions, independent of metadata. We show that our MinHash-based k-mer tool is accurate and scalable, offering reliable criteria for efficient selection of data for further analysis by the scientific community, at once validating submissions while also augmenting sample metadata with reliable, searchable, taxonomic terms. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02490-0.


Background
Established in 2007, the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) accepts raw sequencing data directly from highthroughput sequencing platforms [1].Next-generation sequencing (NGS) sets are inherently large, and improved technologies are exquisitely sensitive to contamination.Submissions must be processed, before either interpretation or quality assessment is possible, to provide submitter feedback and submission verification.The growth of data submission is exponential (doubling approximately every 12 months [2]), rendering use of computationally expensive methods, such as de novo assembly followed by alignment, impractical due to costs and limits of scale, particularly given the time constraint of submission processing.
We considered that questions about the quality of a given NGS run could reasonably be inferred from the taxonomic distribution of reads within that set, whether based on a single organism or of metagenomic design.This is often enough information to answer basic experimental or clinical questions, as well as inform decisions about the merit of subsequent resource-intensive assessment methods.Read sets with organismal tags can be used to select data for further analysis.Moreover, binning reads into taxonomic buckets can identify contaminating reads and reads outside of the stated experimental scope.Such identified reads can be filtered from a sample before downstream processing.This proposed taxonomic analysis is independent of metadata and intrinsic to the run, capable of both validating submissions and augmenting sample metadata with reliable, searchable, taxonomic terms.
Following these principles, we developed a k-mer-based Sequence Taxonomic Analysis Tool (STAT).Based on MinHash [3], and inspired by Mash [4], STAT employs a reference k-mer database built from available sequenced organisms to allow mapping of query reads to the NCBI taxonomic hierarchy [5].We use the MinHash principle to compress the representative taxonomic sequences by orders of magnitude into a k-mer database, followed by a process that yields a set of diagnostic k-mers for each organism.This allows for significant coverage of taxa with a minimal set of diagnostic k-mers.Our results show STAT is a reliable method for examining submitted NGS data in a timely, and scalable, manner.

Results
STAT was developed for quality assessment of SRA submissions to be shared with the submitter, requiring that analyses ideally take no more time than that of existing submission processing, while minimizing resource usage.Our design starts from the Min-Hash principle that a random selection of the lowest valued constituent blocks in a pool after hashing represents a signature of the parent object.In building k-mer databases from the set of sequences assigned a specific NCBI taxonomy id (TaxId), we read 32 base pair (bp) k-mers as 64-bit FNV-1 hashes [6], selecting the minimum hash value to identify the k-mer representative for a window, then iteratively merging k-mers from taxonomic leaves to roots (see "Methods," Figs. 1 and 2).
Initial analysis using only densely populated k-mer databases performed well.However, despite being on average over an order of magnitude smaller than the input sequence database size (see below), we determined that loading the entire densely merged "tree_filter.dbs"into memory for analysis unnecessarily incurred long I/O read time and large memory costs since most runs required only a fraction of the complete database.Moreover, STAT jobs, like many computational pipelines, are submitted to either a local computer farm cluster scheduler ("grid engine"), or by dispatching cloudbased virtual machines.In both cases, job scheduling typically requires explicit needed resource declarations such as CPU and memory.An initial screen capable of evaluating diversity of the sample and necessary resource requirements for detailed analysis minimizes cost and maximizes computational efficiency.For these reasons, we pursued a selective two-step analysis, using a sparse filtering database in the first step to identify the presence of any (a) eukaryote if there are more than 100 biological reads of a species, (b) bacteria, or archaea with more than 10 biological reads, and (c) virus if there are 1-2 biological reads.This first pass is neither qualitative, nor exhaustive, but allows us to quickly identify taxa for focus in the second pass (Fig. 3).
To facilitate this two-step process, and further minimize resource requirements, we decreased k-mer database size by 33% by storing only the 8-byte k-mers in a database file, separately storing pairs of TaxId, total TaxId k-mer count for each TaxId respectively in an auxiliary "annotation" file.The k-mer database / k-mer count annotation file pair is designated "dbss," the database sorted by TaxId, with each TaxId set sorted by k-mer.TaxIds identified in the first step against the sparse k-mer database are used in the second step to load into memory only those TaxId k-mers using the counts provided by the annotation file as offsets.MinHash sampling combined with dynamic loading of only necessary dense TaxId database k-mers yields significant benefits for cpu and memory requirements.Further, the selection of TaxIds to load may be augmented by heuristics, such as purposely withholding TaxIds from contamination detected in the prior filtering step.
STAT reports the distribution of biological reads mapping to specific taxonomic nodes as a percentage of total biological reads mapped within the analyzed run.Since results are proportional to the size of sequenced genomes, a mixed sample containing several organisms at equal copy number is expected to find more reads originating from the larger genomes.This means that percentages reported likely reflect sample genome size(s) and must be considered by the user against the genomic complexity of the sequenced sample.
Like all sequence-based classification schemes employing "least common ancestor" analysis, STAT reflects and depends upon accurate sequence taxonomic attribution, Fig. 2 K-mer taxonomic merging.See "Methods" for details. a Before merging two sibling species are depicted containing both unique and shared k-mers (indicated in bold).b After merging the two shared kmers "merge" up to the genus level, while those unique to each remain diagnostic for the species taxonomic relationships, and taxonomic depth and breadth.The significant achievements of adapting both the NCBI reference sequence (RefSeq) data model [7], and internationally accepted taxonomy to incorporate metagenomic viral sequences [8,9] fundamentally benefit STAT and other similar classification tools. is sequentially rendered into 32 bp k-mers, and matches to the decimal values found in the sparse database identifying taxa for deeper analysis.b TaxIds identified in a are used to select the densely sampled k-mers derived from those taxa, then the same query is used in a second quantitative pass.c Bordered in red is the immediate STAT output consisting of one line for each spot with hits, each followed by one or more TaxIds matching that spot.Examples of more than one hit for a TaxId are shown in bold.d The first post processing output bordered in purple depicts the result of resolving each spot in c to a single taxon.e The final processing step resolves the run composition from the spots resolved in d, and an example from our public display using that result is shown An important consequence of merging in k-mer database construction is to avert complications caused by biological complexities.For example, most k-mers derived from endogenous retroviruses found in the human input reference genome will likely merge to the root as those k-mers would also be found in the Viruses Super Kingdom.
Further, when analyzing results, each level-read, run-requires integration of less than ideal signals.It is common to find multiple TaxIds identified in a single biological read, ideally coherent for a given lineage.Were those Mus musculus, Murinae, and Mammal, there is confidence in declaring the read Mus musculus.Should a read map to multiple, related taxonomic nodes, it is reported as originating from the most proximal shared taxonomic node.For example, a read with hits to sibling species may be reported as their common genus, conservatively locating the most proximal common node before ambiguity (Fig. 4).Likewise, such conservative heuristics are required when integrating the signals from all biological reads to report the run.If the run subject is a single organism, it is expected that STAT would identify taxonomic nodes across the lineage, and that the number of reads mapping to higher level nodes will be more than those mapping to terminal nodes.
STAT was designed as a tool for assessing the quality of any SRA submission and has matured into a tool that also significantly enhances user comprehension.Many kmer tools were created for the purpose of metagenomic taxonomic assignment [10] during STAT development, including some based on MinHash [11,12].Unlike these other MinHash-based metagenomic tools, STAT reports taxonomic hits per spot 1 .Taxonomic classifiers balance speed, accuracy, and memory requirements.While STAT was neither primarily developed for metagenomic analyses, nor as a tool for distribution, the same concerns apply.Using MinHash to sample and save at most 1 out of every 64 k-mers generated from input sequences yields k-mer databases 1-2 orders of magnitude smaller than the parental reference nucleotide database from which they were derived.For example, currently the BLAST® refseq_genomes database used is 1.4 terabytes (tb) whereas the representative sparse and dense STAT k-mer databases are approximately 1.5 gigabytes (gb), and 75 gb, respectively.
We compare STAT accuracy to Kraken 2 using the strain exclusion test as described by Wood et al. [14].While limited to NGS classification tools that report taxonomic assignment on a "per-fragment" basis, using this test allows direct comparison to previous published results, such as Fig. 2 in Wood et al. [14].STAT shows the identical accuracy of Kraken 2 for both bacteria and virus (see Fig. 5).As expected, STAT sensitivity is notably dampened as we chose to sample the widest taxonomic breadth.Our desire for conservative taxonomic assignment is further reflected by STAT never yielding a false positive bacterial identification in accuracy test results (Additional file 3, S1).While NCBI reference bacterial genomes are used for STAT database input, the significant lack of representative RefSeq viral genomes led us to input non-reference viral records.False positives are seen in the virus accuracy test results, though approximately half of these likely represent true biological identification within the host organism genome, while the remainder may indicate database contamination (Additional file 3, S2, S3).
We found it unnecessary to apply the same selection of k-mer hash minimums from query sequences to compose a similarity index [3,4,11,12], instead of exact k-mer matching.We show that accuracy is robust, while still reflecting our conservative bias in taxonomic assignment.Though similar in performance to Kraken 1 input speed (21.6 million reads/minute) and runtime (132.5 s) characteristics, STAT (maximum resident set size 830,304 kilobytes) required only 8% and 1% of the memory needed by Kraken 1 and Kraken 2, respectively [14]. the accuracy test (see "Methods") required additional time for extracting the requested TaxId k-mers on demand.Maximum resident set size during the accuracy test was approximately an order of magnitude greater than Kraken 2 (14, data not shown), despite loading a k-mer database 20 times the "strain_excluded" FASTA file size (3.9 gb) and over 100 times "strain_excluded.dbs"size (545 megabytes (mb)).
We provide two symmetrical examples of expected and unexpected contamination that illustrate STAT effectiveness.

Contamination during a pandemic
Like many public health institutions worldwide Public Health England (PHE) programmatically surveils infectious pathogens using NGS, and submits targeted reference genomic analyses to SRA.The SARS-CoV-2 pandemic emerged in December 2019, and many countries outside China identified their first cases in early 2020 [15].The UK's first cases were identified on January 30, 2020 [16].We began developing scientific pandemic resources using STAT results to identify SARS-CoV-2 submissions (see "Methods") and were surprised when the metadata of many of these records listed a single bacterial source organism.This routine STAT analysis of submissions during early 2020 identified over 2000 PHE surveillance bacterial NGS submissions likely contaminated with SARS-CoV-2 sequences.The earliest of these was dated February 11, 2020, less than 2 weeks from the first recognized UK cases.PHE was alerted to the likely carryover contamination, acting quickly to limit further events.Subsequent investigation confirmed SARS-CoV-2 contamination, ranging from a minimum of 1 positive spot containing 1 positive hit, reaching to 4233 positive spots containing 18,270 hits (see "Methods," and Additional file 1).This example underscores STAT utility in monitoring submissions for possible contamination, allowing curators to contact mitters to alert, and correct, the source of contamination.

Identifying and removing potential personally identifying information
As lower cost significantly expanded human genome sequencing, awareness rose of potential personally identifying information residing in public repositories [17].Large efforts employing NGS to diagnose and monitor human health or detect pathogenic outbreaks such as SARS-CoV-2, caused clinical sample submitters to worry about the inclusion of human sequence.As a counterpart to the previously discussed contamination example, we sought a STAT-based tool to find and remove unavoidable human sequence reads in clinical pathogen samples.
We began by building a k-mer database using human reference sequences withholding the iterative merging previously described.The majority (approximately 80%, see Methods) of k-mers derived represent conserved ancestral sequences, but our goal here is to aggressively identify human sequences.We then subtracted any k-mer also found in the merged kingdom databases Viruses and Bacteria to protect against spurious false positive hits targeting clinical pathogens.After testing several window sizes, we found optimal performance using a segment of 32 bp (twice as dense as our standard taxonomy database).
Because unintended contamination is never uniform, we chose different ends of the expected spectrum of human content for testing (see Table 1).Two RNA_Seq runs were derived from bronchoalveolar lavage fluid taken from suspected SARS-CoV-2 patients.The wash of the lower respiratory tract from a patient suffering an active infection is expected to contain patient immune cells, sloughed patient epithelial cells, lung microbiota, and suspect clinical pathogens.Each run contains over five million spots, with approximately 85% eukaryotic content (see Additional file 2, S5).Table 1 shows that for SRR11092056 the STAT Human Sequence Removal Tool removed 92% (45234589 / 5239723) of the spots, and 90% (4683473 / 5184909) of SRR11092057 spots.The observation that a 3% selection of all possible human-derived 32-bp k-mers Summary of results for SRA accessions subjected to STAT Human Sequence Removal Tool (see Human Contamination Identification and Removal in "Methods")."Total Spots Remaining" is the count of spots found in the output (fastq) file and subtracting this count from the total determine "Total Spots Removed" We define "Human Spots" as those where all hits (up to top five) are identified as human with eValue < 1e−10."Conserved Lineage Spots" are those containing a human top hit (lowest eValue) though not the exclusive organism of hits with eValue < 1e−10, and where all spot hits have either identical eValue or the greatest has eValue < 1e−14."Total Length of Human Spot Alignments" is the sum of all the top alignments for all human spots remaining identifies over 90-92% of a random selection likely human spots using MinHash and underscores its efficiency.These examples present a difficult test, and we identify 5-6% of the remaining spots as human (Table 1).Unlike the previous examples, amplicon-directed sequencing of pathogens is expected to contain less unintended human content, as can be seen in Table 1.In both cases, 0.1% or less spots were removed, while among those remaining, 0.01% or fewer spots were identified as human.In no case was there any deleterious loss of the intended target signal (see Additional file 2, S5 Taxonomic Summary).
It is estimated that as little as 30-80 statistically independent single-nucleotide polymorphisms (SNP) can uniquely identify an individual human [18].The average sequence error rate [19] is greater than estimated human (intra-species) variation [20].Considering the poor coverage of unintended human content in the samples, even in the extreme lavage fluid examples, the total length of spot alignments identified as human are extremely unlikely to reveal validated, statistically independent SNPs capable of individual identification.The great majority of spots characterized by a human best hit though not the exclusive organism of the top five ("Conserved lineage spots" in Table 1) are highly significant alignments to related primates with approximately 20% sharing the same low eValue for all members (see Additional file 2, S1-S4).These likely represent conserved regions unfavored for SNP location [21].

Conclusions
STAT has provided a successful framework for our SRA NGS submission pipeline.Sometimes actual sample content may be unknown, and submitted metadata are often incomplete and of poor quality [22,23].Contamination, as highlighted above, may complicate or confuse further analysis.Recognizing these limitations stimulated our foremost goal to derive signals able to validate and accurately describe submitted data for the benefit of our users.Reflecting the National Institutes of Health (NIH) Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative [24] and ensuring that NIH-funded research data is findable, accessible, interoperable, and reusable (FAIR) [25], results from STAT are available through Amazon Web Services' Athena and Google Cloud Platform's BigQuery query services.Both can be searched to identify runs containing specific organismal content [26] despite insufficient, incomplete, or incorrect metadata, allowing efficient selection of data for further analysis by the scientific community.Over approximately 5 years, we have processed more than 27.9 Peta base pairs from runs averaging 1.1 Giga base pairs in size with average total processing throughput of 3 min per run.While roughly 20% of runs analyzed to date are withheld by submitter request until ready for publication, nearly 10.8 million are publicly queryable records, now richly annotated by STAT analysis.
Building a STAT database is flexible; it can be tailored to specific needs.For example, we are currently testing a STAT k-mer database designed to identify antimicrobial resistance (AMR) in NGS.The AMR_CDS FASTA file containing sequences curated by the NCBI Pathogen group [27] is used as input to generate 32 bp k-mers with a window size = 1; that is, the complete non-redundant k-mer set.For the purpose of removing human reads from clinical pathogen screening samples, we presented a tool combining STAT aligns_to with a human-specific database.As part of recent NIH-wide efforts to combat SARS-CoV-2, we released a detection tool containing and a Virus that allows users to map k-mers found in NGS data to taxa included under Coronaviridae [28].Our choice to maximize taxonomic coverage while minimizing k-mer count has proved a reasonable and effective balance.Employing the principle of Min-Hash in design, we contribute a framework others may find useful and offer the collection of tools to use freely.
The success we and others have experienced is consistent with the notion of a random model of k-mer occurrence [29].Yet, as keenly shown by Breitwieser et al. [30], unique k-mer hits are the most informative.Through serendipity while preparing this manuscript, our colleague John Spouge enlightened us with his method of a nonparametric statistical approach to assess an NGS run using unique hits for confident measurement of taxonomic assignments2 .We are just beginning to explore this implementation in STAT and look forward to reporting results in the future.

General design
STAT refers to a collection of tools for building k-mer databases, querying those databases, and reporting results of our SRA submission pipeline using the former.Details described below are based on our standard pipeline settings.
k-mer size STAT uses 32 bp k-mers (i.e., k = 32) for database generation, and as the unit for comparison.The majority of unaligned SRA data are reads between 60 and 150 bp in length, with mean error rate of 0.18% [19]: such reads can be expected to yield many correct 32 bp k-mers for reliable identification.While reducing from 32 bp k-mers to 16 bp k-mers decreases the size of resulting databases, there is significant loss of specificity (10 9 ) per k-mer that requires notably increased processing to resolve taxonomic assignment.By comparison, using 64 bp k-mers is extraordinarily more selective, but database size becomes impractical.Finally, with each base encoded in 2 bits, 32 bp kmers fit fully and compactly in a 64-bit integer, while anything between 17 bp and 32 bp requires the same 64-bit integer storage resulting in poor memory efficiency and performance.

k-mer databases
Two types of k-mer databases are constructed (as described below).All unique k-mers are generated and the minimum hash valued k-mer representing the segment size is selected.A dense database selects one k-mer per 64 bp segment ("tree_filter"), of input sequence, while a sparse database ("tree_index") selects one k-mer per 64 bp (Virus), 8000 bp (Eukaryota), and 2000 bp (Bacteria, and Archaea) segment respectfully, noting that segment size is roughly proportional to genome size.k-mer generation k-mers are selected using an derived from MinHash To compose STAT databases, for every fixed length segment ("window") of incoming nucleotide sequence, a list of overlapping k-mers (effectively segment length plus right k-1 bp "wings") is generated.The 32 bp k-mers are encoded using 2 bits per base into 64 bits (8 bytes), the smaller value k-mer strand is chosen and used to generate an FNV-1 hash value [6].The k-mer with the minimal 64-bit hash value is selected to represent this segment (see Fig. 1).
From each (Super Kingdom) root, lineage paths traverse nodes where terminal nodes are those containing only child leaves.Input sequences (see below) have an assigned NCBI Taxonomy Id (TaxId) and represent leaves on these trees.These lineage relationships are represented in a two-column file referred to as "parents," wherein each node TaxId (first column) reports its parent node TaxId (second column).
All sequences (see "Database input sequences") attached to a particular TaxId are input to k-mer database generation using segment ("window") sizes as described.For each input set of sequences assigned a TaxId, the immediate output is a dictionary that contains the set of unique 32 bp k-mers derived as described (we designate this "db" file extension).Each dictionary is further transformed into a binary file that encodes every 32 bp k-mer as an 8-byte (64-bit) integer using 2 bits per base, followed by its TaxId represented in a 4-byte (32-bit) integer.Thus, each k-mer record is stored as one 12byte pair (k-mer, TaxId) in a database file designated with "dbs" file extension, sorted by k-mer for binary search optimization.
Next, using the taxonomic node relationships (found in the "parents" file), starting from the leaves we recursively merge each binary ("dbs") file representing a unique set of k-mers derived from a single TaxId to sibling(s), then parent nodes.Each sibling leaf is merged such that k-mers specific to a leaf remain as diagnostic of that TaxId, while those found in neighboring (sibling species) leaves are moved up ("merged") to the common parent node TaxId (see Fig. 2).This process results in a single merged database file ("tree_filter.dbs")representing all k-mers assigned a TaxId.
While it is difficult to generalize, we note that when the process of merging is complete, approximately 20% of the Homo sapiens 32 bp k-mers remain as unique to human; that is, 80% were not diagnostic for the species and instead merged up the eukaryotic tree.
Database generation can be accomplished using any of the build_index* tools (see github), and each takes parameters for window size and k-mer size.The process of merging is accomplished using merge_db.

Database input sequences
We use NCBI BLAST® "refseq_genomes" database [32], supplemented with viral sequences extracted from the BLAST® "nt/nr" database as input source for taxonomy identification in both sparse ("index") and dense k-mer databases Viral records extracted from "nt/nr" by loading only sequences assigned a TaxId whose lineage root is the Super Kingdom "Viruses".

Querying the taxonomic k-mer database (STAT)
To query a k-mer database an input SRA accession or FASTA sequence is used to generate the unique set of query 32 bp k-mers read as 64-bit integers for finding identical value k-mers (and assigned TaxId) from the designated k-mer database using the tool aligns_to.Proximal results are counts for each specific taxonomic k-mer hit (see "Results," Fig. 3).Passed an SRA accession, STAT built with NCBI NGS library support will retrieve query sequences and aligns_to option -unaligned_only is available to limit analysis to the unaligned reads found in the SRA object.

Database filtering
We determined the need to delete low-complexity k-mers composed of > 50% homopolymer or dinucleotide repeats (e.g., AAAAAA or ACACACACACA).This is accomplished using filter_db.We have also investigated "dusting" input sequences [34] and found it complementary to filtering, though it is not used at this time in our pipeline.
Those with metadata identifying a single bacterial source suggesting contamination with SARS-CoV-2 were subject to two further verification methods.All identified design, and implementation.RL tested early versions of tools.The read and approved the final manuscript.

Fig. 1
Fig. 1 Finding a minimum representative 32-base pair k-mer.a From a selected 64 base pair segment, the series of 64 possible 32-base pair k-mers is defined by sequentially shifting the 32-base window by one base.The first four and last of the possible k-mers are shown schematically.b An example k-mer sequence.c Two-bit encoding of the k-mer sequence shown in b. d The 64-bit decimal value of the k-mer sequence shown in b.The k-mer strand with the lower 64-bit decimal value is used to generate a hash, and the minimal valued hash identifies the representative k-mer for this 64 base segment

Fig. 3
Fig.3STAT two phase query.a In the first qualitative phase the input query (an SRA accession, or fasta file) is sequentially rendered into 32 bp k-mers, and matches to the decimal values found in the sparse database identifying taxa for deeper analysis.b TaxIds identified in a are used to select the densely sampled k-mers derived from those taxa, then the same query is used in a second quantitative pass.c Bordered in red is the immediate STAT output consisting of one line for each spot with hits, each followed by one or more TaxIds matching that spot.Examples of more than one hit for a TaxId are shown in bold.d The first post processing output bordered in purple depicts the result of resolving each spot in c to a single taxon.e The final processing step resolves the run composition from the spots resolved in d, and an example from our public display using that result is shown

Fig. 4
Fig. 4 Resolution of taxonomic assignment.Three distinct k-mer hits to multiple branch taxa within a single spot are indicated with a red star.The arrow indicates conservatively the most proximal unambiguous common node

Fig. 5
Fig. 5 STAT accuracy and sensitivity.Comparison of STAT and Kraken 2 accuracy, sensitivity, and F1 measure using positive predictive value at the genus level for a Bacteria and b Virus

Table 1
Summary of STAT Human Sequence Removal Tool Results