Skip to main content

Table 2 Benchmarking results for PacBio CLR data

From: phasebook: haplotype-aware de novo assembly of diploid genomes from long reads

Dataset

Assembler

Size (Mb)

HC (%)

k-mer recovery(%)

Continuity (bp)

QV

Switch error(%)

N (%)

Dup (%)

    

All

Mat

Pat

NG50

NGA50

Phased N50

    

MHC (CLR 25x)

phasebook

10.8

95.2

96.7

88.0

76.3

172,577

172,577

133,141

37.7

0.66

0.0

1.36

 

phasebook-hi

14.3

98.0

97.5

89.4

88.8

361,721

354,437

122,594

40.6

6.26

0.0

1.88

 

Canu

5.9

59.2

96.3

80.7

76.9

2,184,005

62,395

65,217

39.4

4.57

0.0

0.92

 

Falcon

5.4

60.5

94.3

82.2

68.8

4,814,264

32,719

120,818

27.6

5.24

0.0

1.17

 

Flye

5.0

74.1

94.2

64.2

70.5

548,628

66,242

74,992

37.1

5.34

0.0

1.01

 

Wtdbg2

4.7

58.9

90.5

46.9

63.2

–

–

102,431

33.0

5.49

0.0

0.93

 

HapCut2

9.1

56.4

84.2

53.7

53.0

393,164

254,386

282,817

31.3

5.93

8.9

1.52

 

WhatsHap

9.2

56.6

84.2

53.4

53.2

393,164

254,386

279,136

31.2

5.62

8.8

1.52

HG00733 (Chr6) (CLR 44x)

phasebook

453.2

92.9

98.7

89.7

90.6

256,934

253,785

164,373

32.6

5.50

0.0

1.92

 

phasebook-hi

291.0

81.0

97.9

68.2

65.3

587,151

552,300

201,382

33.9

14.41

0.0

1.33

 

Canu

178.0

56.8

97.9

52.7

52.9

110,328

–

119,821

38.8

17.57

0.0

0.98

 

Falcon

185.6

63.4

95.1

41.8

42.0

155,444

132,950

142,167

28.6

22.22

0.0

1.04

 

Flye

168.1

51.9

97.6

25.4

75.8

–

–

2,094,032

42.9

3.12

0.0

0.99

 

Wtdbg2

165.2

64.6

88.9

35.3

35.3

–

–

142,300

24.2

20.07

0.0

1.0

 

HapCut2

341.3

61.8

99.3

92.1

91.7

3,899,799

1,944,878

1,346,888

41.0

5.65

0.4

1.57

 

WhatsHap

341.3

63.4

99.3

92.1

91.6

3,349,274

1,944,877

1,334,202

40.9

5.72

0.4

1.53

HG002 (CLR 25x)

phasebook

5829

–

92.3

60.1

59.2

96,740

–

70,473

31.9

1.61

0.0

–

 

phasebook-hi

6590

–

97.1

62.0

70.6

312,775

–

150,123

35.2

7.16

0.0

–

 

Canu

3119

–

97.1

49.5

62.3

47,412

–

207,853

40.0

6.49

0.0

–

 

HapCut2

6435

–

98.8

87.9

90.2

145,138,636

–

1,756,246

40.1

1.52

5.4

–

 

WhatsHap

6435

–

98.8

87.9

90.2

145,138,636

–

1,729,580

40.1

1.52

5.4

–

A. thaliana (CLR 75x)

phasebook

301

–

89.7

76.6

76.0

66,120

–

39,513

27.0

2.78

0.0

–

 

phasebook-hi

296

–

94.9

89.5

88.8

301,078

–

149,964

33.4

4.25

0.0

–

 

Canu

238

–

94.4

86.8

86.6

204,191

–

64,998

31.9

4.88

0.0

–

 

Flye

142

–

87.1

62.8

62.5

35,796

–

30,209

30.0

7.54

0.0

–

 

Wtdbg2

125

–

35.1

26.4

26.3

11,374

–

35,733

14.4

15.17

0.0

–

 

HapCut2

238

–

91.3

99.4

53.3

18,585,056

–

531,267

36.4

5.73

0.2

–

 

WhatsHap

238

–

91.3

99.5

53.3

1,8585,056

–

1,637,965

36.3

5.72

0.2

–

  1. Due to lack of high-quality phased assemblies as the ground truth, haplotype coverage and NGA50 for HG002 and A. thaliana are not provided. We failed to run Falcon, Flye, and Wtdbg2 for the HG002 (CLR) data on a computing machine (48 cores, 1 TB RAM) probably because of running out of memory. The method phasebook-hi represents a combination of performing Canu’s error correction and trim module on raw noisy long reads and then performing genome assembly for corrected reads with phasebook (HiFi mode)