Contamination detection in genomic data: more is not enough

The decreasing cost of sequencing and concomitant augmentation of publicly available genomes have created an acute need for automated software to assess genomic contamination. During the last 6 years, 18 programs have been published, each with its own strengths and weaknesses. Deciding which tools to use becomes more and more difficult without an understanding of the underlying algorithms. We review these programs, benchmarking six of them, and present their main operating principles. This article is intended to guide researchers in the selection of appropriate tools for specific applications. Finally, we present future challenges in the developing field of contamination detection. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02619-9.

6. I would have liked to see or be directed to actual benchmarking results that compare these methods. A lot of the comparisons felt more like casual observations rather than claims backed by data. As a reader, I would appreciate the comparisons more if they were rooted in benchmarking results comparing these methods and if those results were clearly pointed out.
7. This paper is quite comprehensive as a review paper, and the authors have captured many of the relevant citations. However, some relevant citations are inevitably missed, which I would like to bring to your attention. Minor -----Line numbers given. 55 -66: Can the distinction between the two be defined more precisely? I think I understand the concepts, but sentences here were a bit hard to parse. A more precise definition would help. The Figure was of no help in distinguishing the two for me. 171: sensibility? Perhaps you mean sensitivity? 214: This sentence is not accurate. Kraken does allow inexact matching (through masking), and also, it should not be called an alignment method, in my opinion.

Reviewer 2
This is a well written review focusing on the important issue of genome contamination. The authors summary of the different contamination detection algorithms will be of help to the broader microbial genomics community.  Reviewers' reports > We thank the two reviewers for their detailed proofreading. The quality and interest of our review has been much improved by their suggestions.

Reviewer #1
I read the manuscript with much anticipation because the topic is important, and the authors have noticed a real gap; that is, there are many tools for dealing with contamination but little clarity in terms of their differences and strength. The paper is well written and helps a reader not greatly familiar with these tools (like myself) to gain some level of understanding. There is much value in having a review like this published.
High-level criticism: However, for all my enthusiasm, I came out of reading the paper with a sense that I did not learn as much as I had hoped. I give some more specific comments below. However, before doing that, I wanted to share my interpretation that the review, while quite thorough, lacks discussion of concepts and big picture debates, challenges, pitfalls, things the reader should be worried about, etc. It is excellent to go through all the algorithms, but I would have liked to see more general issues and the taxonomy of issues to consider. I understand that this comment may be hard to apply. So, here are more specific questions.
Detailed points to address 1. Detection of contamination can happen at different stages: that is, on a read-by-read basis, on contains, scaffolds, individual genes, etc. In particular, there is a big difference between looking for contaminants pre and post assembly. I think this distinction is not clear. I believe tools like Kraken would be useful mostly prior to assembly (i.e., they are run 2. Detection of contaminants can happen by 1) checking a sequence against what is expected to be there (call it a positive filter) or 2) checking a sequence against possible contaminants that should not be there but may be there (call it a negative filter). I think it would be helpful to make this clear. Also, I believe most of the methodology described here is for positive filters, though some of the tools, like Kraken and Clark, can be used for negative filters. For example, for a sample supposed to be a dog, you can look for bacterial reads in the set of reads. 3. There is a potential problem with checking reads or perhaps even contains against a database of other sequences from the (supposedly) the same species: namely, the references themselves may be contaminated. Imagine you have drosophila, and you want to find contaminant reads (or contigs) in that drosophila genome. If you search your sequences against a library of other Drosophila genomes, some of which are contaminants with the same species (say, a common bacteria), then you may miss the contaminant. A reader would benefit from understanding this difficulty. I think the authors touch on this in passing, but a more detailed discussion would be good.

> Again, the Reviewer is right. Contamination of public databases is an important problem,
especially for algorithms making use of a reference database (the majority of the tools). In this category, Physeter is the only package able to deal with database contamination by applying a leave-one-out approach. At first, to avoid highlighting our own tool in the review, we had not included a section on database contamination in the manuscript.
However, as we believe that it is an important aspect for the future of the field, we have now added a subsection under the "Future challenges" section to clearly outline the interest and limitations of Physeter's approach.
[Lines 615-638 of the track-change document.] 4. While authors do mention eukaryotes, there seems to be a focus on methods that address bacteria. Perhaps there is a lack of methods for Eukaryotes? If so, it would be good to mention this. If there are many methods for eukaryotes but not covered here, the focus should be clarified to the reader. 7. This paper is quite comprehensive as a review paper, and the authors have captured many of the relevant citations. However, some relevant citations are inevitably missed, which I would like to bring to your attention.

Reviewer #2
This is a well written review focusing on the important issue of genome contamination. The authors summary of the different contamination detection algorithms will be of help to the broader microbial genomics community. > We thank the Reviewer for this good idea. Figure 1 has been modified following this suggestion. Moreover, NCBI SRA has been added and the legend has been expanded.