Within-species contamination of bacterial whole-genome sequence data has a greater influence on clustering analyses than between-species contamination

Although it is assumed that contamination in bacterial whole-genome sequencing causes errors, the influences of contamination on clustering analyses, such as single-nucleotide polymorphism discovery, phylogenetics, and multi-locus sequencing typing, have not been quantified. By developing and analyzing 720 Listeria monocytogenes, Salmonella enterica, and Escherichia coli short-read datasets, we demonstrate that within-species contamination causes errors that confound clustering analyses, while between-species contamination generally does not. Contaminant reads mapping to references or becoming incorporated into chimeric sequences during assembly are the sources of those errors. Contamination sufficient to influence clustering analyses is present in public sequence databases.

contamination datasets can be analyzed. With these tools, we found that within-species contamination caused substantial errors in single-nucleotide polymorphism (SNP) and multi-locus sequence typing (MLST) pipelines, while between-species contamination resulted in fewer errors. Read mapping and assembly behavior explains this observation-reads from the same species are mapped to references or incorporated into the same contiguous sequences (contigs) as subject reads, while reads from different species usually are not.
We measured SNP and allele distances between subjects and closely related isolates ("nearest neighbors") with the CFSAN SNP Pipeline and core-genome MLST (cgMLST) workflows [14][15][16] (Additional file 1: Table  S1). We also performed phylogenetic analyses to provide bootstrap supports for the monophyly of subjects and their nearest neighbors. Importantly, only the subject data are simulated; all other data are real (Additional file 1: Figure S1). This approach provides as realistic a dataset as possible that produces results that apply to real-world situations.
To gain insight into these results, we examined the percent of reads mapped to references. Median values were highest for 0.05 and 0.5% within-species contamination (median 96-100%) and lowest for between-species (median 50-91%), while 5% within-species contamination Fig. 1 Results of SNP and phylogenetic analyses for contaminated datasets. We contaminated simulated Listeria monocytogenes (Lm), Salmonella enterica (Se), and Escherichia coli (Ec) MiSeq data with reads from themselves as controls (Self); genomes from the same species at 0.05, 0.5, and 5% genetic distances; and genomes from different species (e.g., we contaminated Lm with Se and Ec, and we contaminated Se with Lm and Ec) at 10-50% levels. For each contamination type at each level, results for 8 datasets are shown. Panels a-c show SNP distances, d-f bootstrap supports, and g-i percent reads mapped  Tables S2 and S3). For between-species contamination, there is an inverse relationship between contamination levels and the percent of reads mapped to references. For example, at 10% contamination, approximately 90% of reads mapped. It appears that the more distant mapped contaminant reads are, the higher the SNP counts. Contaminant reads that are similar enough to the reference to be mapped but distant enough from the subject to introduce variation will generate errors. In turn, these errors may reduce bootstrap support. A similar relationship exists between allele distances and assembly lengths. Median assembly lengths for 0.05 and 0.5% within-species data are similar to controls (median 3.0-5.6 and 3.0-5.3 megabases [Mb], respectively), while between-species contaminants yielded larger assemblies (median 4.1-9.9 Mb) and the 5% within-species contamination dataset yielded intermediate assemblies (median 3.1-9.1 Mb; Fig. 2g-i).
To measure contamination in public sequence databases, we used ConFindr [13] to analyze 10,000 randomly selected fastq datasets for each of L. monocytogenes, S. enterica, and E. coli (Additional file 2: Table S4). We detected contamination in 8.92, 6.38, and 5.47% of the data, respectively (Additional file 1: Table S5). We detected between-species contamination (1.23, 0.29, and 0.15%) less often than within-species contamination (7.69, 6.09, and 5.33%), consistent with Low et al. [13]. We also analyzed the simulated data with ConFindr and used that information to estimate levels of contamination in the databases that may confound SNP and MLST workflows (Additional file 1: Figure S2 and Table S5). Approximately 1.48 (L. monocytogenes), 2.22 (S. enterica), and 0.87% (E. coli) of the data are contaminated at levels that are likely to Fig. 2 Results of MLST analyses and assembly lengths for contaminated datasets. We contaminated simulated Listeria monocytogenes (Lm), Salmonella enterica (Se), and Escherichia coli (Ec) MiSeq data with reads from themselves as controls (Self); genomes from the same species at 0.05, 0.5, and 5% genetic distances; and genomes from different species (e.g., we contaminated Lm with Se and Ec, and we contaminated Se with Lm and Ec) at 10-50% levels. For each contamination type at each level, results for 8 datasets are shown. Panels a-c show allele counts, d-f numbers of missing and partial alleles, and g-i assembly lengths influence SNP analyses. Roughly 2.26 (L. monocytogenes), 5.06 (S. enterica), and 1.26% (E. coli) of the data are contaminated at levels that may influence MLST analyses.
In summary, we show that within-species contamination (especially by 0.5 and 5% distant genomes) causes more errors in SNP counts, allele counts, and phylogenetic analyses of bacterial genomes [17] than betweenspecies contamination. While other workflows may not yield the exact numbers measured here, the observation that contaminant reads are mapped to references and included in contigs of the same species, resulting in errors, is likely to hold. This study also shows that contamination that may cause errors in clustering analyses is present in public sequence databases. Therefore, it is important that studies include steps to detect withinspecies contamination.
We identified SNP clusters that contain subject genome sequences with the NCBI's Isolates Browser. If SNP clusters had more than 20 taxa, counting the subjects and their nearest neighbors, we randomly selected subsets for further analyses. We also ensured that the subjects and nearest neighbors formed monophyletic groups in phylogenetic trees. We generated SNP matrices with the CFSAN SNP Pipeline v1.0, using the subject assembly as a reference to minimize errors [32]. Alignments of SNPs that were detected by mapping reads to the reference were phylogenetically analyzed with GARLI v2.01.1067 [33] (100 replicates, K80 and HKY). We reported supports for monophyly of subjects and nearest neighbors; if the they were no longer monophyletic, we recorded a support of 0.
We assembled simulated data with SPAdes v3.12.0 and measured assembly statistics with QUAST v4.5. We analyzed Listeria monocytogenes assemblies with the LmCGST core-genome multi-locus sequence typing (cgMLST) tool and Salmonella enterica assemblies with an S. enterica cgMLST tool described in Pettengill et al. [15]. We analyzed E. coli assemblies with a cgMLST developed using the same approach. Partial alleles are those loci whose lengths are less than 60% of the predicted lengths, and missing alleles are those loci that are less than 60% of predicted lengths and less than 80% identical to the reference.
Additional file 1: Figure S1. Phylogenetic tree of 9 Listeria monocytogenes genomes with study subject and nearest neighbor labeled. Figure S2. Results of ConFindr analysis of contamination datasets generated for this study. Table S1. Contextual information for genome sequences used for this study. Table S2. Results of SNP pipeline and core-genome multi locus sequence typing analyses. Table S3. Pvalues for results of clustering analyses. Table S5. Percent of contamination detected in data from NCBI. Table S6. NCBI accession numbers for data generated during this study.