A comparative evaluation of hybrid error correction methods for error-prone long reads

Table 2 Datasets used in performance assessment

Datasets	Bacteria	Yeast	Fly	Plant
Reference organism
Name	Escherichia coli	Saccharomyces cerevisiae	Drosophila melanogaster	Arabidopsis thaliana
Strain	K-12 substr. MG1655	S288C	iso-1	Ler-0
Reference sequences	NC_000913	NC_0011{33–48} NC_001224	NT_0337{77–79} NC_0043{53–54} NC_0245{11–12} NT_037436	NC_0030{70–71} NC_0030{74–76} NC_001284 NC_000932
Genome size (Mbp)	4.64	12.13	143.73	119.67
PacBio data
Accession number	DevNet [53]	DevNet [54]	SRX499318 [55]	SRX533607 [55]
Number of reads	55,137	220,947	6,864,972	7,515,360
Median read length	8473	5295	810	1099
Coverage	113x	112x	204x	301x
Chemistry	P6C4	P4C2	P5C3	P5C3
ONT data
Accession number	ERR1147227, ERR1147228 [56]	ERR1883{398–402}, ERR1883389 [57]
Number of reads	58,221	183,062
Median read length	8652	6427
Coverage	113x	112x
Chemistry	R7.3	R7.3/R9
Illumina data
Accession number	ERR022075 [58]	SRP014568 [59]	ERX645969 [60]	SRR3166543 [45]
Number of reads	45,440,200	28,943,170	179,363,706	324,725,120
Read length	101	101/152	101	100

Note: To maximize the quality of tested LR data, CCS or 2D LR data were used if available; otherwise, subreads or template LRs from the same molecules were used instead. ONT LRs were randomly picked out to get the same data size as PacBio data. For E. coli data, 16.29% DNA molecules had ≥ 2 CCS passes in the PacBio dataset and 70.59% DNA molecules generated 2D LRs in ONT dataset. For S. cerevisae data, 0.68% DNA molecules had ≥ 2 CCS passes in the PacBio dataset and 42.38% DNA molecules generated 2D LRs in the ONT dataset. There were no CCS reads in the datasets of D. melanogaster and A. thaliana, as provided by the original authors of the resource [23]

ISSN: 1474-760X