Skip to main content

Table 1 Summary of our views regarding ‘how essential’ each principle is for a truly excellent benchmark, along with examples of key tradeoffs and potential pitfalls relating to each principle

From: Essential guidelines for computational method benchmarking

Principle (see Fig. 1)

How essential?a

Tradeoffs

Potential pitfalls

1. Defining the purpose and scope

+++

How comprehensive the benchmark should be

Scope too broad: too much work given available resources

Scope too narrow: unrepresentative and possibly misleading results

2. Selection of methods

+++

Number of methods to include

Excluding key methods

3. Selection (or design) of datasets

+++

Number and types of datasets to include

Subjectivity in the choice of datasets: e.g., selecting datasets that are unrepresentative of real-world applications

Too few datasets or simulation scenarios

Overly simplistic simulations

4. Parameter and software versions

++

Amount of parameter tuning

Extensive parameter tuning for some methods while using default parameters for others (e.g., competing methods)

5. Evaluation criteria: key quantitative performance metrics

+++

Number and types of performance metrics

Subjectivity in the choice of metrics: e.g., selecting metrics that do not translate to real-world performance

Metrics that give over-optimistic estimates of performance

Methods may not be directly comparable according to individual metrics (e.g., if methods are designed for different tasks)

6. Evaluation criteria: secondary measures

++

Number and types of performance metrics

Subjectivity of qualitative measures such as user-friendliness, installation procedures, and documentation quality

Subjectivity in relative weighting between multiple metrics

Measures such as runtime and scalability depend on processor speed and memory

7. Interpretation, guidelines, and recommendations

++

Generality versus specificity of recommendations

Performance differences between top-ranked methods may be minor

Different readers may be interested in different aspects of performance

8. Publication and reporting of results

+

Amount of resources to dedicate to building online resources

Online resources may not be accessible (or may no longer run) several years later

9. Enabling future extensions

++

Amount of resources to dedicate to ensuring extensibility

Selection of methods or datasets for future extensions may be unrepresentative (e.g., due to requests from method authors)

10. Reproducible research best practices

++

Amount of resources to dedicate to reproducibility

Some tools may not be compatible or accessible several years later

  1. aThe higher the number of plus signs, the more central the principle is to the evaluation