From: Essential guidelines for computational method benchmarking
Principle (see Fig. 1) | How essential?a | Tradeoffs | Potential pitfalls |
---|---|---|---|
1. Defining the purpose and scope | +++ | How comprehensive the benchmark should be | Scope too broad: too much work given available resources Scope too narrow: unrepresentative and possibly misleading results |
2. Selection of methods | +++ | Number of methods to include | Excluding key methods |
3. Selection (or design) of datasets | +++ | Number and types of datasets to include | Subjectivity in the choice of datasets: e.g., selecting datasets that are unrepresentative of real-world applications Too few datasets or simulation scenarios Overly simplistic simulations |
4. Parameter and software versions | ++ | Amount of parameter tuning | Extensive parameter tuning for some methods while using default parameters for others (e.g., competing methods) |
5. Evaluation criteria: key quantitative performance metrics | +++ | Number and types of performance metrics | Subjectivity in the choice of metrics: e.g., selecting metrics that do not translate to real-world performance Metrics that give over-optimistic estimates of performance Methods may not be directly comparable according to individual metrics (e.g., if methods are designed for different tasks) |
6. Evaluation criteria: secondary measures | ++ | Number and types of performance metrics | Subjectivity of qualitative measures such as user-friendliness, installation procedures, and documentation quality Subjectivity in relative weighting between multiple metrics Measures such as runtime and scalability depend on processor speed and memory |
7. Interpretation, guidelines, and recommendations | ++ | Generality versus specificity of recommendations | Performance differences between top-ranked methods may be minor Different readers may be interested in different aspects of performance |
8. Publication and reporting of results | + | Amount of resources to dedicate to building online resources | Online resources may not be accessible (or may no longer run) several years later |
9. Enabling future extensions | ++ | Amount of resources to dedicate to ensuring extensibility | Selection of methods or datasets for future extensions may be unrepresentative (e.g., due to requests from method authors) |
10. Reproducible research best practices | ++ | Amount of resources to dedicate to reproducibility | Some tools may not be compatible or accessible several years later |