Increased demands on participant to construct reproducible models
In traditional challenge formats, participants download test data sets, run their method, and upload the outputs of their models to challenge organizers. While simple and convenient to participants, this format does not take advantage of the considerable strengths associated with M2D that includes the ability (i) to easily disseminate models to the public, (ii) to perform post hoc experiments and new analyses after the closure of the challenge, (iii) to evaluate performance in newly obtained data sets, and (iv) to develop and experiment with ensemble models. Naturally, there is a trade-off with the additional complexity and overhead required to host—and participate in—a M2D challenge compared to a traditional data challenge. However, while there is an increased upfront burden on participants which may negatively impact participation, this is offset by the greater flexibility and rigor that M2D bring to challenges. However, as familiarity with virtualization and workflow technologies continues to grow—and as the technology itself matures—we expect that these burdens on participants will substantially decrease.
Importance of designing challenges in conjunction with data contributors
Every benchmarking challenge relies on input datasets, and obtaining unpublished validation data requires close collaboration with researchers generating the data. There may be a number of concerns around access and security of that data. Among these is the desire of data contributors to have the first opportunity to publish key scientific results from their data. This can at times conflict with the need to keep datasets private to ensure an unbiased benchmarking challenge. Additionally, challenge validation data may be composed of multiple cohorts each originating from a separate data contributor, as was the case in the Multiple Myeloma Challenge. In such cases, these data contributors may view each other as competitors, and additional care must be taken to ensure such validation data is protected. To ensure the trust of data contributors, we developed guidelines regarding permissible summary statistics or sample characteristics participants could return and audited these accordingly. To further protect validation data in both the Digital Mammography and Multiple Myeloma challenges, we applied a strict size limit to output logs. To drive method development, participants need easy access to training data with clear information about the “truth.” In many cases, the most viable method is to develop synthetic models to generate training data. For example, in the case of the SMC-RNA Challenge, several rounds were scored using synthetic FASTQ files that could be provided to participants with minimal concerns around data privacy.
Develop robust strategies for generating training data
The selection of training and debugging data is a complex issue, and each challenge has had to adopt customized approaches depending on data availability. For some challenge data, there were no privacy issues and training data—a subset of the full data set—could be shared directly with participants, as was done for the Proteomics Challenge. Others challenges have used simulated data to bypass these issues—as in the SMC-RNA Challenge. While simulated datasets may not completely recapitulate the underlying biology, they can provide a baseline on known and expected qualities of the data and can assist in developing robust computational pipelines. For the DM Challenge, none of the primary challenge data could be disseminated to participants. To help with model training, challenge participants could submit Dockerized containers that were permitted to train models using a subset of the imaging data. Limited feedback was returned to participants from method logging, but this required careful scrutiny by challenge organizers to ensure no sensitive data was leaked through the returned log files. Many teams in the DM Challenge utilized public datasets for training seed models and then used the private challenge data for further optimization.
Monitoring, rapid correction, and feedback to participants
A public-facing challenge is a complex interaction that involves providing documentation to users, accepting work products, and making sure outputs are compatible and that novel methods from external parties will function correctly within a pre-set evaluation system. Each of these steps can contain novel software-development, algorithmic, or scientific work. Consequently, challenge procedures need to be put in place that will mitigate common failures that include (1) carefully documenting the input data format and requirements for the model output format, (2) providing a small, representative data set which participants can download and test with their code prior to submission, (3) providing a mechanism for rapid assessment and feedback of execution errors using a reduced size dataset, and (4) performing upfront validation prior to initiating computational expensive and long-running jobs. When running computational models in the cloud, we are asking participants to give up the close, interactive exploration of data they might normally pursue when tinkering with novel algorithmic approaches and to troubleshoot potential defects in their code. In the event that an algorithm fails to execute, providing log files back to the participants may assist in diagnosing and fixing errors. However, this has the potential to leak data or sensitive information and must be tightly controlled. Consequently, if log files must be returned to participants, we recommend using simulated or “open” data for testing and troubleshooting models.
Estimating and managing computational resources
For many challenges, computational methods can have non-trivial run times and resource requirements (see Fig. 3). For example in the SMC-RNA Challenge, methods can average 4 h per tumor. When doing the final computational runs, every method submitted needs to be run against every testing set. This can quickly lead to thousands of computational jobs that cost several thousand dollars, all of which is now run at the cost of the challenge organizers. In a number of different challenges, runtime caps had to be put into place to eliminate methods that took multiple days to complete. In the case of the SMC-Het Challenge, methods were limited to a budget of $7/tumor. A high memory machine cost $0.60 an hour, which equated to ~ 12 h of compute time for memory-intensive algorithms. In some challenges, preemptable machines were used for evaluation, because of their lower costs. But these types of VMs work better for short running methods, that can complete before the cloud provider preempt the system. Efforts such as the Digital Mammography challenge, in which both model evaluation and training are performed in the cloud, require significantly increased compute resources. In this case, we limited compute budgets to 2 weeks per team per round for model training, with four rounds in the challenge. The high-end GPU servers cost several dollars per hour to rent from cloud providers. Not knowing in advance how many participants would join, we faced the risk of running out of computational resources. From this perspective, it is far less risky to ask participants to provide their own computation but, of course, this is only feasible when data contributors agree to let participants download training data. In short, when organizing a challenge, care must be taken to only commit to run the training phase when it is truly necessary for business reasons, such as sensitivity of training data.
Increased flexibility to evolve and adapt a challenge over time
During the active phase of the challenge, and even post analysis, there is a great deal of additional thought and analysis that goes into the evaluation data and the evaluation criteria. In some cases, there are evaluations that need to be made to the dataset, based on characteristics found during the challenge. Fixing these systems during the running of the challenge is inevitable, but every disruption disincentivizes participants from continuing work on the challenge and may limit the moral authority of the challenge to drive community-evolution. In previous challenges, if there was an issue with the testing data, it was impossible to adjust it and send back to users for new analysis. But with portable code, it becomes possible to modify the testing set, rerun methods, and evaluate. The SMC-Het Challenge faced the problem that there were no well-accepted standards for the scoring of complex phylogenetic relationships in cancer. This created a need for development of new methods for model simulation and scoring [10], and these greatly increase the risk of unexpected errors, edge-cases or performance degradations. Because the participants submitted reproducible code, their methods could be reevaluated using newly generated models and evaluation methods.
Model distribution and re-use
Docker containers have a very modular format for distribution, and there exist several different repositories that allow for users to download the software image with a single command. However, this is only one component of distribution; there is also a need for systems that document how to invoke the tool, with descriptions of command-line formatting, tunable parameters and expected outputs. If these descriptions are machine parseable, they can be deployed with workflow engines that manage large collections of tasks. In the case of SMC-Het, the chain of commands was documented using the standards from the Galaxy Project [11]. For the SMC-RNA Challenge, these descriptions were made using the Common Workflow Language (CWL) [doi:https://doi.org/10.6084/m9.figshare.3115156.v2]. These systems allow for automated deployment, and are used as part of the evaluation framework deployed by challenge organizers. Because of this, two of the winning methods from the SMC-RNA Fusion calling challenge have been integrated into the NCI’s Genomic Data Commons [12] (GDC) standard analysis pipeline, and are now being applied to a number of datasets including TARGET, CPTAC, MMRF and TCGA.