Skip to main content

Table 2 Overview of the reference data sets

From: Benchmarking of alignment-free sequence comparison methods

Category

Name

# Sequences

Average sequence length

# Files

# Sequence comparisons

Regulatory element detection

Cis-regulatory modules (CRMs) [6]

370

764 nt

370

68,256

Protein sequence classification

Low sequence identity (< 40%) [57]

1,066

180 aa

1,066

567,645

High sequence identity (≥ 40%) [57]

2,128

184 aa

2,128

2,263,128

Gene tree inference

SwissTree [58]

651

398 aa

651

211,575

Genome-based phylogeny

Assembled genomes

    

 29 E. coli/Shigella strains

29

4,895,247 nt

29

406

 14 plant species

14

337,515,688 nt

14

91

 25 fish mitochondrial genomes [59]

25

16,623 nt

25

300

Unassembled genomes

    

 29 E. coli/Shigella strains

    

  Coverage 0.03125

29,557

150 nt

29

406

  Coverage 0.0625

59,116

150 nt

29

406

  Coverage 0.125

118,266

150 nt

29

406

  Coverage 0.25

236,541

150 nt

29

406

  Coverage 0.5

473,081

150 nt

29

406

  Coverage 1

946,169

150 nt

29

406

  Coverage 5

4,730,778

150 nt

29

406

 14 plant species

    

  Coverage 0.015625

48,274

150 nt

14

91

  Coverage 0.03125

96,489

150 nt

14

91

  Coverage 0.0625

1,931,268

150 nt

14

91

  Coverage 0.125

3,862,905

150 nt

14

91

  Coverage 0.25

7,725,928

150 nt

14

91

  Coverage 0.5

15,461,718

150 nt

14

91

  Coverage 1

30,903,727

150 nt

14

91

Horizontal gene transfer

27 E. coli/Shigella genomes [60]

27

4,905,896 nt

27

351

8 Yersinia species [61]

8

4,605,553 nt

8

28

33 simulated genomes [62]

    

 HGT level 0

33

2,205,524 nt

33

528

 HGT level 250

33

2,149,620 nt

33

528

 HGT level 500

33

2,230,317 nt

33

528

 HGT level 750

33

2,263,926 nt

33

528

 HGT level 1,000

33

2,238,661 nt

33

528

  1. An interactive visualization of all results for all data sets can be found online (http://afproject.org)