From: Prediction of effective genome size in metagenomic samples

Prediction error and identification of sequencing artifacts. Distribution of the prediction error ([predicted - known genome size]/known genome size) of 32 complete genome shotgun datasets downloaded from the NCBI's trace archive (see Additional data file 9 for a list). The majority of predictions have an error estimate <20%, with a median value of about 9%. There are, however, two exceptions in which the error is significantly larger. The first is the Wolbachia endosymbiont of Drosophila melanogaster. The marker OG density in the simulated reads is considerably higher than in the real shotgun data, leading to a 70% difference in predicted genome size. After further investigation of the raw reads, we noticed that this difference was caused by an important contamination of the dataset by reads originating from the organism's host, Drosophila, that were filtered out during the assembly of the genome but that are still present in the shotgun data available at the trace archive. The second exception is the genome of the PCE-dechlorination bacterium Dehalococcoides ethenogenes. Also here, the marker OG density in the shotgun data is lower than in the simulated dataset. Mapping of the publicly available reads to the genome sequence showed a peak of read density in a region that was identified to be an integrated element that is believed to exist in variable copy numbers in different individuals but was only included once in the published genome sequence [51]. OG, orthologous group.

