Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Rapid and sensitive detection of genome contamination at scale with FCS-GX

Fig. 1

Overview of FCS-GX pipeline. FCS-GX splits genome assembly scaffolds into contigs and chunks contigs into 100-kbp subsequences for processing. FCS-GX performs repeat detection and masking in eukaryote assemblies. The GX aligner operates in two passes using modified k-mers (h-mers) to align query sequences first to the entire indexed reference database and second to sequences corresponding to the taxid sets providing best matches for alignment refinement. After collecting coverage and score information FCS-GX assigns likely contaminant sequences by comparing the taxonomic assignment calculated for each sequence by the user-specified taxid. The final output from FCS-GX is a cleaned FASTA alongside an action report that details contaminant cleaning actions taken (FCS-GX actions EXCLUDE, TRIM, FIX) as well as details for additional sequences warranting manual review but are not automatically cleaned (FCS-GX actions REVIEW, REVIEW_RARE, INFO). See “Methods” for descriptions of FCS-GX action categories. In the cartoon example, one complete sequence and one partial sequence assigned as contaminant are removed from the input assembly to produce the final cleaned FASTA. FCS-GX uses a custom reference database totaling 709 Gbp of sequence data from assemblies and common contaminants used in current NCBI screening. Assemblies contributing to the database were screened by FCS-GX while excluding self-hits. High-confidence contaminants were removed in order to use the database for screening new genomes. This can be performed by either adding contaminated database sequence entries to a file which prevents FCS-GX from reporting alignments in subsequent runs or adding heavily contaminated genomes to a separate file which prevents the entire assembly from being used in future database builds

Back to article page