Skip to main content
Figure 4 | Genome Biology

Figure 4

From: Assembly of a phased diploid Candida albicansgenome facilitates allele-specific measurements and provides a simple model for repeat and indel structure

Figure 4

Indels are enriched in repeat sequences upstream of genes. (A) Close-up of 10 kb region of chromosome 1 containing several positions where hundreds of reads deviate from the reference in support of an indel. (B) Expected values for max-to-sum ratios of ‘reference’ and ‘indel’ reads in heterozygous and homozygous regions. (C) Scatterplot of max-to-sum ratios in heterozygous and homozygous regions for every putative indel in the genome. Histograms at top and right show the distribution of data on each perpendicular axis as indicated. The color of each point is based on the legend, where W and C indicate reads from the Watson and Crick strands, respectively. (D) The cutoff for indel designation, indicated in red, has a 5% false discovery rate (FDR), based on fitting the sum of gamma and Gaussian distributions, which reflect the true and false indels, respectively. The histogram in green considered only points with homozygous max-to-sum ratios <1.0 and rectilinear distances of 0.6 or less from the point [1.0,0.5]. (E) Indel density as a function of indel size and distance from the start codon. Density values were normalized to account for the fact that not all coding or intergenic regions span 1,000 nucleotides. (F) Indels are strongly enriched in repeat sequences. (G) Indels are not a sequencing artifact. The average size reported by all reads supporting an indel was calculated and then compiled into a histogram representing all indels. Random sequencing errors would have yielded density at non-integer values and, more importantly, around zero.

Back to article page