Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

Fig. 1

How contamination occurs and how Conterminator detects it. a DNA extraction from an organism (red) is imperfect and often introduces contamination by other species (violet). DNA sequencing then generates short reads that are assembled into longer contigs. Contaminated DNA is typically assembled into separate, small contigs, but sometimes is erroneously included in the same contigs as DNA from the source organism. Contigs may also be linked by scaffolding, which can produce scaffolds containing a mixture of different species. Final assemblies are submitted to GenBank, and higher-quality assemblies are entered in RefSeq. b Conterminator detects contamination in proteins and nucleotide sequences across kingdoms, e.g., bacterial contaminants in plant genomes. The following describes the nucleotide contamination detection workflow. (1) We take taxonomically labeled input sequences and cut them into non-overlapping segments of length 1000 and extract a subset of k-mers. (2) We group the k-mers by sorting them and compute ungapped alignments between the first and all succeeding sequences per group. (3) We extract each region of the first sequence that has an alignment to other kingdoms that is longer than 100 nucleotides with a sequence identity greater than 90 %. We perform an exhaustive alignment of the input sequence segments against the multi-kingdom regions. (4) We reconstruct contig lengths within scaffolds by searching for the scaffold breakpoints (indicated by N characters in the DNA sequence) on the left and right side from the alignment start and end position. We predict that contamination is present if an alignment hits a contig that is shorter than 20 kb that aligns to a different kingdom with an contig length longer than 20 kb

Back to article page