Skip to main content

Table 1 Benchmarking results for PacBio HiFi data

From: phasebook: haplotype-aware de novo assembly of diploid genomes from long reads

Dataset

Assembler

Size (Mb)

HC (%)

k-mer recovery(%)

Continuity (bp)

QV

Switch error(%)

N (%)

Dup (%)

    

All

Mat

Pat

NG50

NGA50

Phased N50

    

MHC (HiFi 15x)

phasebook-hi

11.2

99.4

99.9

99.9

99.5

546,476

539,378

383,554

46.7

0.04

0.0

1.19

 

HiCanu

9.5

90.8

100.0

99.8

99.9

4,827,925

517,913

460,744

42.1

0.00

0.0

1.23

 

Flye

5.1

56.0

96.0

77.4

78.1

138,470

10,861

58,919

44.0

2.79

0.0

0.92

 

Hifiasm

9.7

97.3

100.0

99.9

100.0

4,878,334

2,205,550

620,366

59.8

0.01

0.0

1.21

 

IPA

8.6

81.9

99.5

92.7

99.6

1,426,746

818,630

589,604

47.8

0.03

0.0

1.15

 

Wtdbg2

4.7

49.8

91.2

47.0

70.5

–

–

72,528

38.6

3.76

0.0

0.77

 

HapCut2

9.0

56.5

84.7

54.7

54.0

257,599

155,079

167,061

31.4

5.90

9.0

1.50

 

WhatsHap

9.0

59.4

84.7

55.2

53.7

257,599

155,079

176,555

31.4

6.26

9.0

1.43

HG00733 (Chr6) (HiFi 18x)

phasebook-hi

378.6

91.2

99.3

96.7

96.3

517,478

491,919

317,664

47.6

2.08

0.0

1.53

 

HiCanu

344.6

83.9

99.7

97.3

97.4

1,456,437

991,541

440,882

39.2

1.76

0.0

1.42

 

Flye

169.0

55.1

97.9

53.3

49.3

–

–

70,128

45.0

11.40

0.0

0.97

 

Hifiasm

341.5

93.5

99.7

97.1

96.5

28,008,203

10,513,146

673,184

46.4

1.81

0.0

1.12

 

IPA

280.5

73.3

99.2

82.8

86.0

1,612,661

722,645

460,373

41.7

1.95

0.0

1.21

 

Wtdbg2

167.3

54.8

97.3

49.4

47.3

–

–

72,850

40.6

16.09

0.0

0.97

 

HapCut2

340.8

76.5

99.3

91.7

91.1

379,321

354,305

330,339

40.8

6.48

0.4

1.3

 

WhatsHap

340.9

76.5

99.3

91.7

91.1

381,196

359,841

327,329

40.9

6.52

0.4

1.3

HG002(HiFi 14x)

phasebook-hi

6709

–

97.5

80.0

85.0

136,140

–

111,668

50.5

0.33

0.0

–

 

HiCanu

2953

–

97.1

49.8

54.0

–

–

937,018

57.9

0.15

0.0

–

 

Falcon

2955

–

97.3

49.5

62.2

–

–

501,274

49.1

0.40

0.0

–

 

Hifiasm

3067

–

97.5

49.6

65.3

–

–

1,146,665

54.0

0.11

0.0

–

 

HapCut2

6435

–

98.8

87.8

90.2

145,138,636

–

858,407

40.0

1.64

5.4

–

 

WhatsHap

6435

–

98.8

87.8

90.2

145,138,636

–

858,156

40.0

1.64

5.4

–

  1. The sequencing technology and the average sequencing coverage per haplotype are shown in the first column. The MHC dataset is simulated whereas the others are real. Size (Mb) represents the size of assemblies generated by each assembler. Due to lack of high-quality phased assemblies as the ground truth, haplotype coverage and NGA50 for HG002 are not provided. NGA50/NG50 calculation uses a diploid genome size (double haploid genome size). The haploid genome size of MHC, HG00733(Chr6), and HG002 is 4.7 Mb, 171 Mb, and 3.1 Gb, respectively. HC(%) is the haplotype coverage. In the k-mer recovery (%) multicolumn, all is the k-mer completeness for both haplotypes combined, mat is maternal hap-mer completeness and pat is paternal hap-mer completeness. N (%) is the ambiguous bases proportion. Dup (%) is the duplication ratio. The assemblers marked as italics (HapCut2 and WhatsHap) are reference-guided methods, whereas the others are de novo assembly methods. Note that we compared with IPA in MHC and HG00733(Chr6) datasets, the official PacBio assembler for HiFi reads instead of Falcon. The publicly released assemblies of HG002 (Canu, Falcon, Hifiasm) were directly used for comparison