Overview
PEGR is a project management platform designed to organize, track, and disseminate the workflow of (epi)genomic projects from the start of an experiment through DNA sequencing, bioinformatic analyses, and figure generation (Fig. 1). It tracks sample information and sequencing metadata, manages the bioinformatics workflow, and provides QC reporting and visualization. PEGR is intended to enable a more complete scientific workflow from hypothesis generation through publication-quality figures.
PEGR supports user submission of detailed sample and experiment information through several methods, including a web-interface, real-time QR reagent barcode tracking with an Android app, and an Excel-based sample submission form. After experimental metadata has been recorded in PEGR (e.g., cell line, species, assay), a sequencing run can then be relationally linked to this information. PEGR tracks Illumina sequencing runs in real-time by periodically probing the sequencer’s output data repository. When PEGR detects the completion of a sequencing run (i.e., RunCompletionStatus.xml), it will automatically initiate an external bioinformatics workflow platform. Presently PEGR natively supports the Galaxy platform and is extensible to other workflow engine platforms such as Pegasus and bash shell scripting [22]. PEGR collects the metadata information of bioinformatic outputs in real-time and displays them in an online workflow monitoring dashboard. Users are also able to query PEGR programmatically using a RESTful API for metadata related to specific samples or workflow runs. Critically, PEGR maintains and tracks the relational links between the figures and analyses generated by the bioinformatic software back through the starting reagents for any given sample.
Inventory management
PEGR provides an inventory management section in support of a key aspect of experimental reproducibility, creating the ability to simply and easily track the exact chemicals, enzymes, reagents, antibodies, equipment, and sample material that form the basis of experimental assays. While this form of reagent tracking is accomplished by most laboratories at a basic level, in practice, the quality of record-keeping can be variable depending on laboratory organizational structure and personnel training [20]. PEGR was architected to include an integrated inventory management system that seamlessly tracks all aspects of an experiment’s metadata. To reduce incorrect and faulty information from being uploaded into PEGR, approved inventory entries (ItemType) can be pre-defined by administrative users (Fig. 2A). Metadata fields such as name, vendor, catalog number, and lot number can be added to each ItemType to help guide the initial deployment and match LIMS fields available for import.
PEGR was designed to provide maximum flexibility for an assay workflow. Custom fields can be defined for a specific item type in the admin console. Given the wide range of possible reagents and variables a lab may choose to track, PEGR provides a simple CSV upload form which allows an administrative user to upload a list of all inventory item types and the related aspects of inventory metadata that a laboratory desires to track (Fig. 2A). This provides compatibility with common laboratory management systems such as Agilent iLab and Quartzy which possess CSV export functionality of their tracked inventory [25, 26]. To prevent disorganization resulting from tracking all possible inventory items in a laboratory, ItemType’s are grouped into an ItemTypeCategory (Fig. 2B). This structure allows for the intuitive organization of the full spectrum of a laboratory’s inventory. The item types in the ItemTypeCategory list are defined through the web interface and provide the ability to dynamically assign and re-assign ItemTypes as the needs of the laboratory change.
The “Inventory” tab on the main PEGR navigation bar is the primary interface for tracking all instances of ItemTypes in a laboratory. New instances of ItemTypes can be created and old instances can be marked “inactive” by regular users (e.g., technicians, graduate students) as reagent stockpiles are finished (Fig. 2C). To reduce the “activation energy” required for adopting and maintaining in-depth tracking of reagent catalog numbers, specific lot numbers, aliquot dates, etc., PEGR leverages an easy-to-use QR barcode scanning application (“Barcode Scanner” app by ZXing) on Android devices that updates the PEGR backend database system in real-time [33]. The barcode scanner can be activated directly from a webpage in PEGR, and the result is returned to PEGR via a callback URL. Materials received by the lab that already have an attached barcode can be scanned from the Android devices and the appropriate metadata is recorded along with a time stamp. Purchased reagents and client samples with no existing barcode can be assigned a new barcode generated by PEGR. The barcode is shown in both text and 2D QR image in PEGR. PEGR’s barcode system also integrates with existing label printers which allows for the 2D image to be printed in different sizes to accommodate the physical dimensions of the inventory.
Experimental protocol versioning and integration with inventory management
PEGR provides a protocol assembly and management module. While the traditional method of protocol management for most wet laboratories is a physical paper binder containing common buffer recipes and basic experimental procedures, there are cloud-based approaches for experimental protocol management such as OpenWetware (https://openwetware.org) and Protocol-Online (http://www.protocol-online.org/). In contrast to these approaches, PEGR’s protocol management system directly links its laboratory inventory metainformation with tracked and version-controlled experimental protocols (Fig. 3A).
Defining the exact ItemType input and output for each protocol is crucial for PEGR to properly track laboratory metadata across an experiment. When new experiments are initialized, the user is required to follow the predefined protocol and record the items used during an experimental setup. It is required that all the item types defined in the protocol be linked to an item instance before the user can move on to the next step. The process of defining a new protocol in PEGR is possible through two distinct options under PEGR’s “Protocol” tab. The first method of generating a new protocol is directly through the PEGR user interface. A simple webform links to PEGR’s ItemType database and allows for the creation of protocols ranging from simple buffer recipes to highly complex multi-stage assays such as ChIP-seq using a controlled reagent vocabulary (Fig. 3B). The other method of protocol initialization uses a CSV file upload similar to the one used by the ItemType tab (Fig. 3C). This allows for bulk upload of multiple protocols, a convenient feature for adding in a large number of novel assays.
The minimum requirements for creating a new protocol include a protocol name, version number, and a protocol description. Users are encouraged to also upload a protocol file in PDF format that is stored by PEGR for users to download and print. When creating a new protocol, a user selects starting and ending materials for the protocol. These fields are within an enforced list of all ItemTypes defined in PEGR. It uses autocomplete to assist in creation. In the case of a simple buffer protocol, the individual components of the protocol are the input ItemTypes (i.e., 5M NaCl, 500 mM EDTA) and the end product is the final buffer (e.g., NaCl 250 Wash Buffer). In the case of a protocol such as PCR, the “Traced Sample” field is also used to track a sample’s state entering and exiting a protocol stage. The concept of a “Traced Sample” allows PEGR to link a final sample across multiple protocols that the sample may participate in (Fig. 3D).
Similar to how ItemTypeCategory is used to organize the wide variety of ItemTypes in the “Inventory,” Protocol Groups are used to consolidate and organize the variety of protocols that often compose an experiment (Fig. 3E). Protocol Groups contain any number of protocols (e.g., ChIP reaction and DNA end repair) in an ordered set. This enforces both experimental organizations and provides flexibility for a user to generate a new Protocol Group (e.g., ChIP-seq v2) by re-using previously defined protocols in combination with new protocols. ProtocolGroup’s are initialized through the Admin console. This design consideration requires a ProtocolGroup to be thoroughly vetted by a PEGR administrator (i.e., Principal investigator, lab manager) before it can be accessed and used by the entire group. While users are still able to construct and initialize any individual Protocol they desire, this produces an intentional pause-point in developing novel assays which requires users (i.e., graduate students) to reflect on their experimental design and discuss with a relevant senior scientist.
Tracking experimental metadata as it is generated
PEGR provides a section to record experiment metadata under the “Experiment” tab. In designing a section devoted to capturing experiment metadata, we considered that an experiment involves (a) inventory, (b) a protocol, (c) an input sample along with controls, and (d) a resulting product. The PEGR “Experiment” interface is designed to track and maintain the relational links between reagents (i.e., “Inventory”), protocols (i.e., “Protocol”), and the resulting end products (i.e., “Samples”). A new experiment can be initiated directly from the web interface (Fig. 4A). A new experiment can be assembled by combining any number of previously defined protocols into any desired organizational structure (Fig. 4B). Alternatively, a user can initialize a new experiment based on a Protocol Group (Fig. 4C). This provides an easy mechanism for quickly assembling common laboratory protocols and assays in a structured and well-defined manner.
Once an experiment is initialized in PEGR, the experimental metadata and status can be updated directly through the web interface. Starting a simple experiment (e.g., creating a wash buffer) allows users to add relevant inventory metadata to PEGR using a wizard-style guide that walks users through adding reagents and all their associated metadata that have previously been stored in PEGR inventory (Fig. 4D). While this can all be performed directly through the web interface, PEGR leverages a QR barcoding system, similar to the inventory system, to allow users to progress through experimental stages and collect their metadata in real-time using a hand-held Android device. This information includes but is not limited to Protocol ID, Reagent ID, Equipment ID, Tech ID, date, etc. Thus, each scanned item is linked to each experiment as associated metadata. Although we display the functionality of the webform for visualization purposes, the QR barcode scanner is the recommended method for linking experimental metadata in PEGR. In cases where the inventory item has never been previously instantiated within PEGR, a web-interface allows the user to define a new QR barcode and instance of the inventory item.
A typical lab process is to generate common laboratory reagent stocks (e.g., wash buffers) that are used multiple times across many different downstream experiments. However, more complicated experimental setups like ChIP-seq, involve a “traced” sample which moves through multiple sequential experiments and combines with different reagents as it transitions through product states (e.g., sonicated chromatin converts to DNA library). A “traced” sample typically begins as a “BioSample” in PEGR. The “BioSample” is assigned a unique “Sample” ID within the PEGR database the moment it is added to an Experiment. This provides a clear delineation in the creation of new Samples in PEGR and helps to prevent users from initializing any number of theoretical Samples that are unlinked to any Experiment. This functionality mirrors the best practices of a standard laboratory notebook. As lab notebooks are not designed to record proposed experiments, but only provides the record of a performed Experiment, this logic is consistent with standard biochemical wet-bench practices. Importantly in the case of traced samples, PEGR can display all the states that a sample has transitioned through allowing for full experimental history tracking. A traced sample can be added to an experiment using either the web-interface or the QR barcode system (Fig. 4E). Importantly, PEGR allows multiple samples to be attached to a single protocol. This enables the operator to process multiple samples in a batch while only needing to enter the related information once (e.g., when performing ChIP-seq on 8 samples in parallel).
Sequencing and automated bioinformatic workflows
PEGR provides a section to record metadata for DNA sequencing and bioinformatic data processing. Samples processed in parallel through PEGR’s “Experiment” module are natively grouped as “cohorts.” Cohorts are typically generated by a researcher when addressing specific questions within a scientific project. The biochemical end for these (epi)genomic cohorts is typically high-throughput sequencing (or other detection systems) and downstream data analysis. As the throughput of DNA sequencers continues to expand, the available sequencing bandwidth for any given sequencing run will often exceed the needs of a cohort of samples. As a result, multiple cohorts are often sequenced together in a single sequencing run. These multiplexed samples may originate from different scientific projects (Fig. 5A). Reciprocally, one or more cohorts comprise a Project, to which cohorts may be added over time. Therefore, we define a “sequencing cohort” as the group of samples that belong to the sample project and a specific sequencing run.
PEGR provides substantial integration with common Illumina platforms. It implements a real-time workflow tracking and quality control dashboard through integration with external bioinformatics systems, such as the Galaxy platform (Fig. 5B) [21]. Galaxy workflows designed to communicate with PEGR contain simple XML wrappers for Python scripts which send a JSON file to PEGR RESTful API in a standard HTTP POST request. The JSON file contains a variety of information tracked by PEGR, and one critical element is the History ID from Galaxy. This allows PEGR to connect reproducible bioinformatic Galaxy workflows with the biochemical records stored and managed by PEGR [34].
As the output data from each analysis step returns to PEGR at the completion of each step, the status of the workflow is updated in real-time (Fig. 5C). The status of each analysis step being tracked is represented by a square. If the script completes successfully and passes the preliminary validation, the square will be colored in green. A script that results in one or more error messages has its square colored red. Clicking on the square renders the error messages in detail. API calls with “permission denied” have their square colored orange, and analysis steps with missing datasets have their square colored blue. For analysis steps that have not communicated back to PEGR, the square remains gray. If all squares become green, it indicates that the entire workflow has completed successfully. Note that bioinformatic workflows may vary for different sample types, and they may include different sets of analysis steps. To accommodate different workflows, PEGR defines a configuration for each workflow that lists all the analysis steps to be tracked. The workflow tracking panel is dynamically rendered according to this configuration.
In addition to tracking the workflow-specific metadata (e.g., peak-calling completes successfully, MEME failed), PEGR also tracks assay-independent quality control metrics such as total reads per sample, adapter dimers, mapped reads, uniquely mapped reads, and PCR-duplication level (Fig. 5C). Through the web interface, the administrator may define the acceptable range for each field indicated at the header, and fields that have values outside the acceptable ranges are colored in red. This combination of statistics gives users an overview of the quality of the sequencing experiment. The thresholds for what constitutes an acceptable result are user-specified through the web-frontend of PEGR. After reviewing the statistics, Admins can indicate if the sample has passed the quality control check and been “verified.” If the statistics indicates that errors may exist in the sequencing result (e.g., incorrect adapter index assignment), the authorized user can “delete” the sample directly on this page.
The workflow tracking and quality control dashboard can become quite wide as there is no upper limit to the number of scripts (i.e., columns) that may be tracked in PEGR. Users can hide columns by clicking the “−” sign on the header. The columns can be restored by clicking the “+” sign at the top. Multiple bioinformatics workflows can be applied to the samples in a single sequencing run or even a single sample. In this case, PEGR will display separate tabs for each workflow.
Reporting, visualization, and data dissemination
The Reporting module of PEGR is an interface to report, visualize, and disseminate the data it stores. PEGR provides a “Project” interface to organize sequencing cohorts and individual samples in a reporting and visualization dashboard (Fig. 6A). The project dashboard provides links to the dynamically generated reports of entire cohorts of samples or individual samples (Fig. 6B). This interface also provides a mechanism for granting project permissions to the various users of PEGR. This allows PEGR to provide its stored sample metainformation to external collaborators while limiting their ability to access all data within PEGR that may not apply to a specific shared project.
When selecting an individual sample in PEGR, users are presented with a custom report that contains all affiliated metadata with that sample (Fig. 6C). Understanding that compliance can be difficult to achieve in certain settings, PEGR does not require a complete metadata trail for data visualization and will only display the data that it has. In addition to providing the biochemical sample metadata, PEGR will also display the results of the bioinformatic Galaxy workflows live-streamed directly from Galaxy (Fig. 6D). As a key feature, PEGR does not duplicate raw or processed data files in Galaxy. PEGR only stores the relevant metadata needed to point to sequencing datasets and downstream analysis stored on Galaxy (i.e., Galaxy HistoryID). This design choice allows PEGR to track millions of sample details with a single CPU server. The current version of PEGR used by the Cornell EpiGenomics Core tracks ~70 Tb of Galaxy-generated analyses using less than 10 Gb of hard disk space.