Skip to main content

Table 2 Datasets used in performance assessment

From: A comparative evaluation of hybrid error correction methods for error-prone long reads

Datasets

Bacteria

Yeast

Fly

Plant

Reference organism

 Name

Escherichia coli

Saccharomyces cerevisiae

Drosophila melanogaster

Arabidopsis thaliana

 Strain

K-12 substr. MG1655

S288C

iso-1

Ler-0

 Reference sequences

NC_000913

NC_0011{33–48}

NC_001224

NT_0337{77–79}

NC_0043{53–54}

NC_0245{11–12}

NT_037436

NC_0030{70–71}

NC_0030{74–76}

NC_001284

NC_000932

 Genome size (Mbp)

4.64

12.13

143.73

119.67

PacBio data

 Accession number

DevNet [53]

DevNet [54]

SRX499318 [55]

SRX533607 [55]

 Number of reads

55,137

220,947

6,864,972

7,515,360

 Median read length

8473

5295

810

1099

 Coverage

113x

112x

204x

301x

 Chemistry

P6C4

P4C2

P5C3

P5C3

ONT data

 Accession number

ERR1147227, ERR1147228 [56]

ERR1883{398–402}, ERR1883389 [57]

  

 Number of reads

58,221

183,062

  

 Median read length

8652

6427

  

 Coverage

113x

112x

  

 Chemistry

R7.3

R7.3/R9

  

Illumina data

 Accession number

ERR022075 [58]

SRP014568 [59]

ERX645969 [60]

SRR3166543 [45]

 Number of reads

45,440,200

28,943,170

179,363,706

324,725,120

 Read length

101

101/152

101

100

  1. Note: To maximize the quality of tested LR data, CCS or 2D LR data were used if available; otherwise, subreads or template LRs from the same molecules were used instead. ONT LRs were randomly picked out to get the same data size as PacBio data. For E. coli data, 16.29% DNA molecules had ≥ 2 CCS passes in the PacBio dataset and 70.59% DNA molecules generated 2D LRs in ONT dataset. For S. cerevisae data, 0.68% DNA molecules had ≥ 2 CCS passes in the PacBio dataset and 42.38% DNA molecules generated 2D LRs in the ONT dataset. There were no CCS reads in the datasets of D. melanogaster and A. thaliana, as provided by the original authors of the resource [23]