Skip to main content

Table 2 Genome assembly continuity and correctness using hybrid and self-correction approaches

From: Reducing assembly complexity of microbial genomes with single-molecule sequencing

Organism Corrected by Assembly bp Number of contigs (expected) Number of contigs (actual) N50 (expected) N50 (actual) LAP Number of discordant bases QV
E. coli K12 Reference 4,639,675   1 4,639,675 NA -9.65E + 07 4 >60
  MiSeq 100× 4,647,253 1 2   2,367,319 -9.64E + 07 3 >60
  454 50× 4,649,004 1 1   4,649,004 -9.64E + 07 3 >60
  CCS 25X 4,653,267 1 1   4,653,267 -9.64E + 07 3 >60
  Self 4,653,486 1 1   4,653,486 -9.64E + 07 3 >60
E .coli O157:H7 Near neighbor 5,594,477   3 3,776,951 NA -3.82E + 07 1,282 36.40
  MiSeq 100× 5,624,394 10 10   3,089,011 -3.66E + 07 4 >60
  454 40× 5,613,057 10 12   927,294 -3.67E + 07 13 56.35
  Self 5,611,389 10 9   4,324,437 -3.66E + 07 0 >60
B. trehalosi MiSeq 100× 2,402,545   6   1,603,511 -3.28E + 07 1 >60
  454 50× 2,413,761   4   1,051,672 -3.27E + 07 2 >60
  CCS 25X 2,411,501   1   2,411,501 -3.27E + 07 0 >60
  Self 2,411,068   1   2,411,068 -3.27E + 07 0 >60
M. haemolytica MiSeq 100× 2,712,467   1   2,712,467 -3.31E + 07 0 >60
  CCS 25X 2,739,949   2   2,686,992 -3.31E + 07 0 >60
  Self 2,736,037   1   2,736,037 -3.31E + 07 0 >60
F. tularensis Near neighbor 1,895,727   1 965,253 NA -1.33E + 07 113 42.25
  MiSeq 100× 1,879,071 3 10   357,518 -1.33E + 07 0 >60
  454 50× 1,863,947 3 15   201,203 -1.33E + 07 0 >60
  Self 1,828,135 3 8   401,731 -1.33E + 07 0 >60
  Self (300×) 1,877,407 3 3   573,021 -1.33E + 07 0 >60
S. enterica Newport Near neighbor 5,007,719   2 4,827,641 NA -2.26E + 07 20 53.99
  MiSeq 56X 5,027,784 4 2   4,918,796 -2.24E + 07 2 >60
  454 25X 5,034,500 4 3   4,095,943 -2.24E + 07 2 >60
  CCS 22X 5,030,885 4 2   4,921,886 -2.24E + 07 2 >60
  Self 5,029,197 4 2   4,919,684 -2.24E + 07 2 >60
  1. Organism: the genome being assembled. Corrected by: the short-read data used for correction. Assembly bp: the total number of base pairs in all contigs (only contigs containing at least 100 reads are included in all results). Number of contigs (expected): predicted number of contigs for a known reference (or near-neighbor). Number of contigs (actual): the number of contigs comprising the assembly. N50: N such that 50% of the genome is contained in contigs of length ≥N. LAP: the assembly likelihood score. A score closer to zero indicates a better assembly. Number of discordant bases: the number of SNPs and indels identified by mapping MiSeq sequences back to the assembly and recording discrepancies. Each incorrect base is counted (that is, an indel that is a deletion of two bases from the assembly counts as two in this column). QV: estimated from the number of discordant bases as log 10 assembly length # incorrect bases * 10 . The QV can be converted to an error probability P=10^(-QV/10). Assemblies were generated by Celera Assembler [31] followed by post-processing with Quiver [32]. NA, not available.