Skip to main content

Table 1 Summary of our views regarding ‘how essential’ each principle is for a truly excellent benchmark, along with examples of key tradeoffs and potential pitfalls relating to each principle

From: Essential guidelines for computational method benchmarking

Principle (see Fig. 1) How essential?a Tradeoffs Potential pitfalls
1. Defining the purpose and scope +++ How comprehensive the benchmark should be Scope too broad: too much work given available resources
Scope too narrow: unrepresentative and possibly misleading results
2. Selection of methods +++ Number of methods to include Excluding key methods
3. Selection (or design) of datasets +++ Number and types of datasets to include Subjectivity in the choice of datasets: e.g., selecting datasets that are unrepresentative of real-world applications
Too few datasets or simulation scenarios
Overly simplistic simulations
4. Parameter and software versions ++ Amount of parameter tuning Extensive parameter tuning for some methods while using default parameters for others (e.g., competing methods)
5. Evaluation criteria: key quantitative performance metrics +++ Number and types of performance metrics Subjectivity in the choice of metrics: e.g., selecting metrics that do not translate to real-world performance
Metrics that give over-optimistic estimates of performance
Methods may not be directly comparable according to individual metrics (e.g., if methods are designed for different tasks)
6. Evaluation criteria: secondary measures ++ Number and types of performance metrics Subjectivity of qualitative measures such as user-friendliness, installation procedures, and documentation quality
Subjectivity in relative weighting between multiple metrics
Measures such as runtime and scalability depend on processor speed and memory
7. Interpretation, guidelines, and recommendations ++ Generality versus specificity of recommendations Performance differences between top-ranked methods may be minor
Different readers may be interested in different aspects of performance
8. Publication and reporting of results + Amount of resources to dedicate to building online resources Online resources may not be accessible (or may no longer run) several years later
9. Enabling future extensions ++ Amount of resources to dedicate to ensuring extensibility Selection of methods or datasets for future extensions may be unrepresentative (e.g., due to requests from method authors)
10. Reproducible research best practices ++ Amount of resources to dedicate to reproducibility Some tools may not be compatible or accessible several years later
  1. aThe higher the number of plus signs, the more central the principle is to the evaluation