Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Widespread false gene gains caused by duplication errors in genome assemblies

Fig. 1

Overview to identify false duplication. a Mechanisms of how false assembly duplications are created. If haplotype phasing is included and correctly performed in the assembly process, there will be only one allele in the primary assembly, with the other placed in the alternate assembly (right panel, column 1). However, without proper phasing, both alleles of heterozygous loci may be assembled in one scaffold (column 2) or two different scaffolds (column 3) of the primary assembly. Alternatively, randomly or systematically piled up erroneous sequencing reads can generate false duplications (column 4). This leads to three types of false duplications. b Scheme to identify false duplications. Whole-genome alignment between the two assemblies using Cactus and self-alignment using purge_dups reveal candidate false duplicated regions or whole contigs. The union-set from these two independent methods is then used to find false duplications, which contain some combination of near haploid read-depth of the 10X Genomics linked reads, the presence of gaps between duplications, and discordance in read pairs between duplications. c Scheme to classify false duplication types. Copy number and multiplicity of k-mers are calculated from the assembly and the 10X Genomics linked reads respectively and used to classify false duplications as heterotype or homotype. Heterotype duplication includes haploid specific k-mers (i.e., 1-copy). Homotype duplication does not include haploid specific k-mers, but does include sequencing errors that can be detected by read-depth below the haploid-level

Back to article page