Dissemination of scientific software with Galaxy ToolShed

The proliferation of web-based integrative analysis frameworks has enabled users to perform complex analyses directly through the web. Unfortunately, it also revoked the freedom to easily select the most appropriate tools. To address this, we have developed Galaxy ToolShed.

several modules for manipulation of BAM files. The ToolShed configuration file for NVC contains instructions necessary for installation of these dependencies. This enables Galaxy, for instance, to download the NumPy package from its native URL and to compile it for a platform running any given Galaxy installation (such as the one on the AWS utilizing the Ubuntu operating system). FreeBayes, on the other hand, is a very different kind of analysis utility (http://goo.gl/26yuLj). To install FreeBayes, its C++ source code needs to be downloaded from GitHub and built with g++ compiler while noting that some of its components require the cmake utility. In addition, a particular version of SamTools [2] needs to be built alongside FreeBayes, and appropriate environmental variables have to be configured for the newly installed tool to be accessible for a Galaxy instance. The ToolShed contains configuration syntax that makes this possible.
While the three tools provide an overview of the ToolShed's configuration syntax, they do not illustrate the full extent of possible complexities. One of the largest and most advanced sets of tools available in the ToolShed is represented by the ChemicalToolBoX (http://goo.gl/lAfxLv; Gruning et al. Submitted). The ChemicalToolBoX is a collection of 32 tools depending on over 20 external packages. Another intriguing feature of the ChemicalToolBoX is that it is not a genomic set of utilities, but a collection of computational chemistry tools. Other examples of complex tool suites available from the ToolShed include the metagenomic packages mothur [3] and QIIME [4], each containing close to a hundred individual tools and integrated by community contributors.

Reproducibility and tool versioning
Low overall reproducibility of published results represents a significant challenge for today's biomedical research, effectively blocking scientific progress [5,6]. In fact, reproducibility is an ensemble of related, but independent, issues ranging from providing access to primary data to recording exact details of every analytical procedure. One of the most challenging aspects of making biomedical analyses repeatable is managing versions of the tools used to interpret data. This is because software evolves continuously and the latest versions of any given tool may not produce the same results as an earlier one. Fig. S1 shows variation in allele frequency at a human mitochondrial site depending upon which version and parameter combination of a widely used short read mapper, bwa [7], has been used in the analysis. One can see that the earlier versions have been particularly problematic. In this regard, making every Galaxy analysis reproducible meant keeping every version of every tool in all existing instances throughout the world, which is not practical. The ToolShed solves this challenge by providing a centralized tool versioning system. Because every Galaxy tool is versioned, repeating analyses becomes possible even when a particular instance is 1 missing the correct tool: the user is warned that the current version of a tool is different and is provided with an option of installing the correct version from the ToolShed.

Ensuring quality of ToolShed submissions
Anyone can submit to the ToolShed, which now contains over 2,000 tools. The idea behind such openness is decreasing the initial barriers for faster adoption by the community. This approach has worked well for us before, yet it has one significant disadvantage. With submission being straightforward (http://goo.gl/1cKagk), there is no reward for quality, making it difficult for end users to differentiate between good tools and low quality submissions. To deal with this situation, we have established the Intergalactic Utilities Commission (IUC), consisting of Galaxy tool developers from US, Europe, and Australia, tasked with reviewing and flagging high quality tools. To simplify the work of the IUC, we have developed a suite of ToolShed components designed to automatically evaluate submissions prior to formal IUC review. These components include a series of scripts that verify the existence of test data, and execute the functional tests defined within the tool configuration.

Beyond Tools
As more and more local and cloud-based Galaxy instances are being used, there is a need for a central hub that would serve as a middle ground for storing and exchanging analytical components. We view the Tool-Shed as such a hub. In this report, we described its functionality in regard to handling analysis tools. However, it already extends beyond tools to include workflows. When one installs a workflow, the ToolShed automatically installs tools that are needed for the workflow but are missing from a given Galaxy instance.
In the future, we will extend the ToolShed to provide a centralized repository of tools, data, metadata, analysis workflows and practices as well as their published descriptions in the form of Galaxy Pages, which will be linked to relevant journal articles, providing an unprecedented level of research reproducibility and transparency.

Accessing data and tools
A new Galaxy instance on Amazon can be instantiated by pointing a web browser to http://usegalaxy.org, selecting "Cloud" on the upper pane of the interface, and clicking on the "New Cloud Cluster" link (the prodedure is also detailed at http://usegalaxy.org/cloud). The following screencasts detail all ToolShed aspects described in this manuscript: In addition, we provide a BAM file containing reads from blood (SRR345592) and twenty RNA-seq timepoints (SRR353635 -SRR353654) aligned against the hg19 version of the human genome (http://goo.gl/puWbOC). This file was prepared in the following way. First, reads from individual samples were aligned against the hg19 version of the human genome within Galaxy using bwa version 0.5.9-r16.

Naive Variant Caller
The Naive Variant Caller (http://goo.gl/QVQcp9) processes aligned sequencing reads from the BAM format and produces a VCF file containing per position variant calls. This tool allows multiple BAM files to be provided as input and utilizes read group information to make calls for individual samples. User configurable options allow filtering reads that do not pass mapping or base quality thresholds and minimum per base read depth; users can also specify the ploidy and whether to consider each strand separately. In addition to calling alternate alleles based upon simple ratios of nucleotides at a position, per base nucleotide counts are also provided. A custom tag, NC, is used within the Genotype fields. The NC field is a comma-separated listing of nucleotide counts in the form of <nucleotide>=<count>, where a plus or minus character is prepended to indicate strand, if the strandedness option was specified.

Variant Annotator
The Variant Annotator (http://goo.gl/SLJwkF) processes the raw variant count data from the Naive Variant Caller. Single nucleotide variant counts and allele statistics are reported for each site in a simple tabular format. Data from multiple samples are supported, via sample columns in the input VCF. The first and second most abundant variants are reported, along with the frequency of the latter. The user can set a coverage threshold, which is applied to each strand individually. An allele count is computed, based on the number of alleles passing a user-supplied frequency threshold. A basic filter for strand bias is applied at this stage, excluding sites where the threshold-passing alleles differ between the strands. At these sites, neither allele count is used, and the tool will instead mark it zero.