Skip to main content
Figure 1 | Genome Biology

Figure 1

From: Detection and correction of false segmental duplications caused by genome mis-assembly

Figure 1

Mis-assembled DCC and DOC. Assemblers may mistakenly form two contigs from the two haplotypes, as shown in (a) where contig A contains heterozygous sequence and contig B contains homozygous sequence (light) on both sides of a matching heterozygous region (dark) (with sequencing reads as lines above them). We refer to A as a duplicated contained contig (DCC). We can identify this situation by finding an alignment between contigs A and B that completely covers contig A and comparing contig A's mate pair links in the original location to those same links when contig A is overlaid on contig B at the location of its alignment, as shown in (b). Dashed curves in (a) indicate distances that are significantly shorter (left side of figure) or longer (right) than expected; solid curves indicate distances that are consistent with specifications. In the situation shown here, we would designate contig A as an erroneous duplication likely to have been caused by haplotype differences. Alternatively, heterozygous sequence may be separated into two contigs that each include some homozygous sequence on opposite ends, as in contigs C and D in (c), which we refer to as duplicated overlapping contigs. If a significant alignment exists between the ends of these contigs and the distances between mate pairs pointing right from contig C and left from contig D better match their expected fragment sizes when the contigs are joined, we designate the region as an erroneous duplication and join the contigs as in (d).

Back to article page