Using HTML tag library
Use of GIVE’s HTML tags does not require any downloading or installation. The simplest way to try out GIVE’s HTML tags is to use HUG, a graphical interface that will generate an HTML file for user chosen datasets.
Instead of using HUG, a web developer can import the entire GIVE library to a web page by inserting the following two lines (Lines 1, 2).
To display genomics data, the web developer can use either the <chart-controller> tag or the <chart-area > tag. The <chart-controller> tag will display genomic data as well as genome navigation features such as shifting and zooming (Additional file 1: Figure S5C). For example, adding the following line in addition to the two lines above would create a website similar to that in Additional file 1: Figure S5 (Line 3).
Here, the title-text attribute sets the title text of a website. The <chart-area> tag will display the track data without metadata controls such as data selection buttons and input box for genomic coordinates, while retaining some interactive capacities including dragging and zooming. This option provides the developer greater flexibility for website design. In addition, the <chart-area> tag is compatible with mobile apps.
Using GIVE-toolbox
GIVE-Toolbox is a set of command line tools offered to manage custom data (Additional file 1: Table S1). These command line tools automate data-related operations and relieve website developers from directly programming with a database language (MySQL). In addition to comprehensive documentation and tutorials (Additional file 1: Table S3), executing each tool with –h argument will output usage instruction. GIVE-Toolbox is our recommended option; however, developers can choose to directly work MySQL instead.
Running GIVE-Docker as a standalone executable
Utilizing Docker’s container technology (https://www.docker.com), we encapsulated GIVE’s codes and all the environmental requirements and database including Apache, MySQL, and PHP into a fully packaged executable called GIVE-Docker. This standardized executable can be deployed without system specific configuration to all mainstream operating systems and cloud computing services, including Linux, macOS, Windows 10, AWS, and Azure. This standalone executable does not require system administration or installation of any prerequisite compiler or database and therefore is the recommended option. Use of the GIVE HTML tag library does not require running GIVE-Docker.
Experienced programmers can choose custom installation instead of using GIVE-Docker. A step-by-step guide of custom installation is provided in GIVE’s online manual.
Backstage technologies
The following technologies are wrapped inside the GIVE library. Website developers who use GIVE do not have to understand them or even know their existence.
Query
A query is issued when the user views any genomic region (query region). A new query is issued when the user changes the genomic region. A query induces two actions, which are data retrieval and display of data.
Oak, a data structure
A data structure called Oak is developed to effectively load and transfer a subset of data in BED format. The subset is defined as a continuous genomic region within a chromosome. Oak is a type of tree data structure, with nodes defined below.
A node is composed of a list of key-value pairs and a set of attributes. A key is a pair of starting and ending genomic coordinates, termed left key and right key, respectively. When populated with data, a node keeps the data for a genomic region defined by the first left key and the last right key. The keys in a node partition the genomic region into non-overlapping sub-regions. A node can be either a branch node or a leaf node. The difference between a branch and a leaf lies in their values. A branch node is a node where the values are other nodes. A leaf node is a node where each value is a set of two lists of data points (Additional file 1: Figure S6). Each data point is a row of a BED file. When populated with data, the first list contains all the rows in the BED file where the start position matches the left key. The second list contains all the rows where the start and the end positions cover (span across) the left key. A value in a leaf node can also be empty. Leaf nodes with empty values are used to mark the genomic regions outside the query region.
Creating an oak instance, populating data, and updating oak
An Oak instance will be created, populated with data, or get updated in response to a query. These actions accomplish data transfer from the server to a user’s computer. Only the data within the queried region will be transferred. Hereafter we will refer to an Oak instance as an Oak.
When the query region is on a new chromosome, an Oak will be created as follows. Every unique start position in the BED file that is contained within the query region is used to create a leaf node. The genomic regions on the queried chromosome but outside the query region are inserted as pairs of keys and empty values (placeholders) to the nodes with the nearest keys. The leaf nodes are ordered by their first left keys and sequentially linked by their pointers. A root node is created with all the leaf nodes are its children. This initial tree is fed into a self-balancing algorithm [28, 29] to construct a weight balanced tree, thus finishing the construction of an Oak.
When the query region is on a previously queried chromosome, the query region will be compared with the Oak of that chromosome and the overlapping region will be identified. The data of the overlapping region are therefore already loaded in the Oak and for the purpose of saving time; this should not be loaded again. The data in the rest of the query region will be loaded to the Oak. This is done by first creating a leaf node for every additional unique start position, removing the placeholder key-value pairs, and adding new placeholder key-value pairs for the rest of the chromosome. The weight balancing algorithm [28] is invoked again to re-balance this Oak. The weight balancing step prepares the Oak for efficient response to future queries.
Pine, a data structure
A data structure called Pine is developed to effectively load and transfer a subset of data in bigWig format. The subset is defined as a continuous genomic region within a chromosome. Pine can automatically determine the data granularity, which avoids transferring data at a higher than necessary resolution. The resolution of displayed data is limited by the number of pixels on the screen. Pine instances are always constructed to the appropriate depth and match the limit of the resolution.
A node consists of a list of key-value pairs and a set of attributes. The attributes are the same as those of Oak nodes, except there is an additional attribute, called data summary. The data summary includes the following metrics for a given node (the genomic region defined by the first left key and the last right key of the node): the number of bases; sum of values (summing over every base); sum of squares of the values; maximum value; and minimum value. A key is a pair of starting and ending genomic coordinates termed left key and right key, respectively. The keys in a node divide the genomic region into non-overlapping sub-regions. A node can be either a branch node or a leaf node. Their differences lie in the values. A branch node is a node where the values are other nodes (Additional file 1: Figure S7A). A leaf node is a node where each value is a list of data points (Additional file 1: Figure S7B). Each data point is a row of a bigWig file (binary format).
A node in Pine can have an empty key-value list and an empty data summary. If this is the case, we call it a placeholder node.
Creating a pine instance, populating data, and updating pine
A Pine is created when a query to a new chromosome is issued. A Pine is created with the following steps. First, the depth of the Pine tree is calculated as:
$$ Tree\ depth= Ceiling\ \left({\mathit{\log}}_n\left( chromosome\ length\right)-{\mathit{\log}}_n(resolution)\right) $$
(1)
The limit of the resolution (length of genomic region per pixel) is the total length of the queried genomic region (viewing area) divided by the number of horizontal pixels, namely the width of the SVG element in JavaScript.
Next, a root node is created with keys covering the entire chromosome where the query region is contained within. Until reaching the calculated depth, for any node that overlaps with the query region, create a fixed number (n, n = 20 in the current release) of child nodes by equal partitioning its genomic region. If any of the created child nodes do not overlap with the query region, use a placeholder node. For each node, point the pointer to the “right hand” node at the same depth. Thus, a Pine is created. This Pine has not loaded with actual data.
To load data, every leaf node issues a request to retrieve the summary data of its covered region (between the first left key and the last right key), which will be responded to by a PHP function wrapped within GIVE. This function returns summary data between the input coordinates from the bigWig file. After filling the summary data for all nodes at the deepest level, all parent nodes will be filled, where the summary data are calculated from the summary data of their child nodes. This process continues until reaching the root node.
A Pine will be updated when a new query partially overlaps with a previous query. In this case, the new depth (d2) is calculated using Eq. 1. This depth (d2) reflects the new data granularity. If d2 is greater than the previous depth, extend the Pine by adding placeholder nodes until d2 is reached. From root to depth d2–1, if any placeholder node overlaps with the query region, partition it by creating n child nodes. If any of the newly created child nodes does not overlap with the query region, use a placeholder node. For any newly created node, point the pointer to the “right hand” node at the same depth. At this step, the Pine structure is updated into proper depth. Finally, at depth d2, retrieve summary data for every non-placeholder node that has not had summary data. Update the summary data of their parent nodes until reaching the root. In this way, only the new data within the query region that had not been transferred before will get transferred.
Memory management
We developed a memory management algorithm called “withering.” Every time a query is issued, this algorithm is invoked to dump the obsolete data, which have not been used in the previous ten queries. “Withering” works as follows: all nodes are added with a new integer attribute called “life span.” When a node is created, its life span is set to 10. Every time a query is issued, all nodes overlapping with the query region as well as all their ancestral nodes get their life span reset to 10. The other nodes that do not overlap with the query region get their life span reduced by 1. All the nodes with life span equals 0 are replaced by placeholder nodes.