Skip to main content

Table 2 Datasets used in performance assessment

From: A comparative evaluation of hybrid error correction methods for error-prone long reads

Datasets Bacteria Yeast Fly Plant
Reference organism
 Name Escherichia coli Saccharomyces cerevisiae Drosophila melanogaster Arabidopsis thaliana
 Strain K-12 substr. MG1655 S288C iso-1 Ler-0
 Reference sequences NC_000913 NC_0011{33–48}
 Genome size (Mbp) 4.64 12.13 143.73 119.67
PacBio data
 Accession number DevNet [53] DevNet [54] SRX499318 [55] SRX533607 [55]
 Number of reads 55,137 220,947 6,864,972 7,515,360
 Median read length 8473 5295 810 1099
 Coverage 113x 112x 204x 301x
 Chemistry P6C4 P4C2 P5C3 P5C3
ONT data
 Accession number ERR1147227, ERR1147228 [56] ERR1883{398–402}, ERR1883389 [57]   
 Number of reads 58,221 183,062   
 Median read length 8652 6427   
 Coverage 113x 112x   
 Chemistry R7.3 R7.3/R9   
Illumina data
 Accession number ERR022075 [58] SRP014568 [59] ERX645969 [60] SRR3166543 [45]
 Number of reads 45,440,200 28,943,170 179,363,706 324,725,120
 Read length 101 101/152 101 100
  1. Note: To maximize the quality of tested LR data, CCS or 2D LR data were used if available; otherwise, subreads or template LRs from the same molecules were used instead. ONT LRs were randomly picked out to get the same data size as PacBio data. For E. coli data, 16.29% DNA molecules had ≥ 2 CCS passes in the PacBio dataset and 70.59% DNA molecules generated 2D LRs in ONT dataset. For S. cerevisae data, 0.68% DNA molecules had ≥ 2 CCS passes in the PacBio dataset and 42.38% DNA molecules generated 2D LRs in the ONT dataset. There were no CCS reads in the datasets of D. melanogaster and A. thaliana, as provided by the original authors of the resource [23]