We surveyed papers presenting new methods and checked whether these new methods were assessed as better than existing competitors. We then investigated whether the same pairwise comparisons, when performed in later—presumably unbiased—studies, again evaluated the new method as better than the old. This principle is schematically illustrated in Fig. 1. Imagine that a method “B” is introduced in the literature some time after a method “A” had been introduced. In the paper presenting method B, the authors compare it to method A (typically, they find that B is better than A). This is what we call a non-neutral comparison because the authors are interested in demonstrating the superiority of method B. Some time later, a paper suggesting another method, “C”, is published. The authors compare C to existing methods A and B, which implies that this study also compares B to A, although this latter comparison is not the paper’s focus. This study is assumed to be neutral with respect to the comparison of A and B, exactly as the later study termed a “neutral benchmark study” in Fig. 1.
In the simplified scenario of Fig. 1, we have one non-neutral study and two neutral studies, all comparing A and B. In the terminology commonly used in the debate on the replication crisis, the former could be denoted as an “original study” and the latter two as “replication studies”. Of course, in general, there may be any number of neutral studies, not just two. Most importantly, in our survey, we did not focus on a specific pair of methods, but looked for all pairs that were compared in the non-neutral study introducing the newer one and at least one subsequent neutral study.
Note that each pair of methods thus acts as its own control, helping to avoid problems due to confounding, such as through chronological time effects (if scientific progress works correctly, new methods are expected to be on average better than older ones, although this is not necessarily the case in practice [23]).
We identified 27 relevant studies (see supplement for details). We extracted pairs of methods that were compared in the non-neutral study introducing the newer of the two and at least one subsequent study which was neutral with respect to this pair (i.e., a paper introducing a third method or a neutral study not introducing a new method).
Some of the papers present several substudies (typically investigating different aspects of the methods), each comparing all or a subset of the methods considered in the paper. In a first analysis, we focused on the substudies that compare all the methods considered in the paper, i.e., we excluded the “partial substudies.” With this strategy, we found 19 pairs of methods compared both in the paper introducing the newer of the two and in at least one subsequent paper which was neutral with respect to this pair. For each pair and each paper, we recorded whether the newer method was ranked better than the older method. In the second analysis, we considered substudies as independent studies, i.e., ignored that they exist in “clusters” (the paper they come from), and did not exclude incomplete substudies. This yielded a total of 28 pairs. For each pair and each substudy, we again recorded whether the newer method was ranked better than the older.
The supplement provides full details on the analysis methods, and the data and R code to generate the results.
The results are displayed in Fig. 2. In the first analysis (considering papers as unit), the new method was ranked better than the older in 94.7% (18 of 19) of the comparisons from non-neutral papers introducing the new method, a very high rate similar to one detailed in a previous survey of the statistical and bioinformatics literature [9]. In neutral comparisons (subsequent papers that were neutral with respect to the considered pair), these same new methods outperformed 64.3% of their paired competitors (49.5 of 77 neutral comparisons, where pairs with equal performances count as 0.5). This rate lies between the rate for non-neutral papers and the rate of 50% assumed for a method that performs equally well as other methods. This finding suggests a noteworthy optimistic bias in favor of new methods in the papers introducing them, but also the realization of scientific progress, i.e., newer methods are on average superior to older.
The second analysis (considering substudies as unit) shows the same trends. The newer method was ranked better than the older in 83.2% (136.5 of 164) of the comparisons from non-neutral substudies, revealing that even according to biased authors, their method is not superior in every situation. This rate is once more higher than the 61.2% (408.5 of 667) observed for neutral substudies, again suggesting optimistic bias. It is however much lower than the 94.7% observed in the first analysis, which is in agreement with Norel et al.’s claim that “when the number of performance metrics is larger than two, most methods fail to be the best in all categories assessed” [16]: even the overall “better” method, as judged by the first analysis, is recognized by the authors of the better method as performing worse in some substudies.
Our study has some limitations. We neither performed a systematic literature search nor assessed the quality of the investigated papers, in particular the quality of the performance measures, as our study is meant as illustrative. The evaluator extracting the data could obviously not be blinded to the type of paper (non-neutral or neutral). This lack of blindness could have slightly distorted our results, as the evaluator’s expectation was that new methods would tend to be optimistically rated in the papers introducing them. This expectation might have affected, for example, his (partly subjective) evaluation of blurred or ambiguous graphical results within the papers being evaluated. Moreover, not taken into account in our study were the sizes of the differences between method performances: a method was considered either better or worse than the other, which obviously leads to a loss of information and precision.
The precise definition of a “method” also presents a problem. Many of the papers here evaluate the methods within a full “pipeline” of preprocessing, with such optional steps as background correction and elimination of probes based on detection p-values: the result is that comparisons between two methods in different papers may be based on different pipelines (it should however be noted that the availability of such pipeline “parameters” presents another opportunity for preferential reporting). In a similar vein, we were required to make subjective decisions on whether different implementations of similar algorithms constituted distinct or equal methods. Similarly, the evolution of a method over time was also not taken into account. Authors often release new (hopefully improved) versions of packages implementing their methods; when two methods are compared and then subsequently compared at a later time, the evaluations may not be completely consistent, although likely lacking systematic bias.
Truly “neutral” authorship could also not be verified, as we are ignorant of any personal feelings and connections our neutral-labelled authors may have, and extensive authorship lists may have overlap we did not take into account. Most importantly, the interpretation of the complex, multidimensional comparison of methods from the papers was very difficult. In particular, due to dependence patterns (within studies and between methods), standard statistical inference (e.g., deriving confidence intervals for the above-mentioned rates) was impossible. Despite these limitations, we feel this study convincingly illustrates the issue of over-optimism and indicates that its order of magnitude is not negligible.