Open Access

ggbio: an R package for extending the grammar of graphics for genomic data

Genome Biology201213:R77

DOI: 10.1186/gb-2012-13-8-r77

Received: 8 June 2012

Accepted: 31 August 2012

Published: 31 August 2012

Abstract

We introduce ggbio, a new methodology to visualize and explore genomics annotationsand high-throughput data. The plots provide detailed views of genomic regions,summary views of sequence alignments and splicing patterns, and genome-wide overviewswith karyogram, circular and grand linear layouts. The methods leverage thestatistical functionality available in R, the grammar of graphics and the datahandling capabilities of the Bioconductor project. The plots are specified within amodular framework that enables users to construct plots in a systematic way, and aregenerated directly from Bioconductor data structures. The ggbio R package isavailable athttp://www.bioconductor.org/packages/2.11/bioc/html/ggbio.html.

Rationale

Visualization is an important component of genomic analysis, primarily because itfacilitates exploration and discovery, by revealing patterns of variation andrelationships between experimental data sets and annotations. Data on the genome fallinto two classes: annotations, such as gene models, and experimental measurements, suchas alignments of high-throughput sequencing data. The unique and unifying trait of allgenomic data is that they occupy ranges on the genome. Associated with the ranges isusually multivariate meta-information both at the feature level, such as a score orfunctional annotation, and at the sample level, such as gender, treatment, cancer orcell type. These data ranges can range in scale from hundreds to billions of datapoints, and the features are dispersed along genomes that might be many gigabases inlength. Visualization tools need to slice and dice and summarize the data in differentways to expose its different aspects and to focus on different resolutions, from asensible overview of the whole genome, to detailed information on a per base scale. Tohelp focus attention on interesting features, statistical summaries need to be viewed inconjunction with displays of raw data and annotations.

Various visualization tools have been developed, most of which are implemented in theform of a genome browser. Data are typically plotted along with annotations with genomiccoordinates on the horizontal axis with other information laid out in different panelscalled tracks. Examples of genome browsers include the desktop-based browsers IntegratedGenome Browser [1, 2] and Integrative Genomics Viewer [3, 4]. There are also web-based genome browsers, including Ensembl [5], UCSC Genome Browser [6], and GBrowse [7], and several new web-based browsers, like Dalliance, which rely ontechnologies like HTML5 and Scalable Vector Graphics [8], or Adobe Flash, like DNAnexus [9]. Other software, like Circos [10], provide specialist functionality. R also has some new tools for visualizinggenomic data, GenomeGraphs [11] and Gviz [12]. They all have advantages for different purposes: some are fast, while othershave easier user interfaces. Some are interactive, offer cross-platform support orsupport more file formats.

Data graphics benefit from being embedded in a statistical analysis environment, whichallows the integration of visualizations with analysis workflows. This integration ismade cohesive through the sharing of common data models [13]. In addition, recent work on a grammar of data graphics [14, 15] could be extended for biological data. The grammar of graphics is based onmodular components that when combined in different ways will produce different graphics.This enables the user to construct a combinatoric number of plots, including those thatwere not preconceived by the implementation of the grammar. Most existing tools lackthese capabilities.

A new R package, ggbio, has been developed and is available on Bioconductor [16]. The package provides the tools to create both typical and non-typicalbiological plots for genomic data, generated from core Bioconductor data structures byeither the high-level autoplot function, or the combination of low-level components ofthe grammar of graphics. Sharing data structures with the rest of Bioconductor enablesdirect integration with Bioconductor workflows.

Basic usage

In ggbio, most of the functionality is available through a single command, autoplot,which recognizes the data structure and makes a best guess of the appropriate plot.Additional file 1 Table S1 lists the type of plot produced foreach type of data structure. The plot in Figure 1 was renderedwith the autoplot function in ggbio, using the following code:
https://static-content.springer.com/image/art%3A10.1186%2Fgb-2012-13-8-r77/MediaObjects/13059_2012_Article_3010_Fig1_HTML.jpg
Figure 1

Gene structure. Example of exons of gene SSX4 and SSX4B isoforms, annotated to illustrate the grammar of graphics extensions used.The filled rectangle represents exons and the chevron represents introns. They aregrouped by transcript ID and the y axis shows the stepping levels, which stackstranscripts to avoid overplotting. Color is mapped from strand direction.

autoplot ( grl , aes ( color = strand ) )

The data, grl, is a GRangesList object, a Bioconductor data structure for representingcompound ranges, including a set of transcript structures. The autoplot functionrecognizes the GRangesList object and draws the intervals in the typical fashion for agene model or alignment. The y axis is generated by a layout algorithm to ensure thatthe transcripts are not overplotted. The x axis is automatically set as the genomiccoordinates. The call to aes maps the strand variable to the color aesthetic. Users canspecify labels, titles, layout, and so on by passing additional arguments to autoplot.Compared to the more general qplot API of ggplot2, autoplot facilitates the creation ofspecialized biological graphics and reacts to the specific class of object passed to it.Each type of object has a specific set of relevant graphical parameters, and furthercustomization is possible through the low-level API, introduced later.

Plotting tracks

In many genome visualizations, different datasets are typically plotted separately andthen stacked on top of the same x axis, the genomic coordinates. These plots are oftencalled tracks, because they are usually much wider than they are tall and run parallelto each other. Each track might contain a heatmap, or a histogram, or a density plot,for example. The data displayed in each track is related to the data displayed in theother tracks through the shared genomic axis. The goal is to be able to observedifferent patterns in these snapshots for regions of the genome. When displaying arelatively large region, the tracks might show a summary of the data, whereas moredetails are depicted for smaller regions.

The ggbio package provides a function called tracks, which stacks plots in a specifiedorder and creates a Tracks object. The object allows users to zoom and shift the plot,as well as modify parameters like the height and theme of individual tracks. In thefollowing example, p1, p2 and p3 are plot objects in the session:
tracks ( p 1 , p 2 , p 3 , heights = c ( 4 , 1 , 1 ) )
Figure 2 illustrates the creation of tracks. One tumor/normal pairof RNA-seq samples is shown with the goal of interpreting splicing changes. A key aspectof the plot is how the read alignment coverage and junction counts are composited andthen juxtaposed with the data from the other sample, as well as the annotated transcriptmodels. The viewer is then able to relate the changes in coverage to the correspondingtranscript structures, via the common x axis. At the top of the figure is an ideogramoverview of the chromosome, using colors corresponding to the Giemsa stain.
https://static-content.springer.com/image/art%3A10.1186%2Fgb-2012-13-8-r77/MediaObjects/13059_2012_Article_3010_Fig2_HTML.jpg
Figure 2

Splicing summary with coordinate truncate gaps. An example of a plot madefrom multiple tracks. At the top, the relevant chromosome is drawn with thesubregion of interest marked in red. The middle track shows the slicing summaryplots for the gene ALDOA for normal and tumor samples. Splicing is shownas arches and size is used to represent junction counts and color representsnovelty: blue indicates known splicing events against the model and red indicatesnovel splicing events. The height of arches is proportional to the distancebetween the two ends of the arches, or the distance between the junction reads.Coverage is shown by position to address supporting evidence in the raw data. Thesplicing summary plots are aligned with a view of the gene structure in the bottomtrack. The thicker rectangle represents the Consensus Coding DNA Sequence (CCDS)transcripts. The plots in both tracks are made with truncate gaps coordinatetransformation. The space dedicated to introns has been significantly reduced, andthe exonic regions are shown in detail, even though the entire gene region is inview.

Genomic overviews

The purpose of an overview plot is to give a grand view of the entire genome. Bydefinition, this means that the resolution will be poor and that only large featureswill be visible. An overview may reveal large features that might be missed if onefocused too narrowly. Different methods for mapping the genomic axis to the screen havebeen applied to address the space issues, and also to ease the drawing of connectionsbetween regions.

Grand linear view

Figure 3 shows an example of a grand linear view, which laysout the entire genome along a single linear axis. The plot shown in the figure is aspecial case of the grand linear plot called a Manhattan plot, due to its resemblanceto the Manhattan skyline. It emphasizes extreme events, which show up as high-valuedoutliers. This view has been used for many genome-wide association study reports [17]. The data here come from a genome-wide association study on Angus cattle,and the data are faceted by three different scoring and classification methods [18]. The horizontal axis shows the global genomic coordinates, and thevertical axis is generally mapped to some quantity of interest, in this case, thegenetic variance.
https://static-content.springer.com/image/art%3A10.1186%2Fgb-2012-13-8-r77/MediaObjects/13059_2012_Article_3010_Fig3_HTML.jpg
Figure 3

Manhattan plot. Grand linear view applied to a Manhattan plot as part ofa genome-wide association study in Angus cattle. The y axis shows geneticvariance, calculated by sliding windows of five consecutive SNPs for theinfectious bovine keratoconjunctivitis (IBK; a type of pinkeye) score. The xaxis is the genomic coordinates with all the chromosomes side-by-side. Thehorizontal striping of color helps to indicate the end of one chromosome andbeginning of another. The plot is faceted by three different analysis methods.There is one extreme variance in the middle facet, in the region of chromosome23. There are also a few large values in other regions. According to theresults from the paper, three of these regions, 2, 13, and 23 are found to bepotentially indicative of a quantitative trait locus associated with IBK.

The plot uses linear layout and employs the genome coordinate transformation, whichtransforms the chromosomal coordinates into global genomic coordinates as if all ofthe chromosomes were concatenated together. The transformation supports bothproportional and uniform scaling of chromosomes, so that the plot area consumed by achromosome is either proportional to its length or the same as the other chromosomes.It is also possible to add a buffer or break between chromosomes.

Karyogram overview

Figure 4 shows a (single copy) karyogram overview plot, withthe color indicating RNA-editing locations in human [19]. The karyogram layout represents chromosomes as rectangles and stacks themvertically or in a grid layout. Genomic position is relative to the chromosome,starting from the first position at the left for each rectangle. Associatedinformation is overlaid on or over the box. Applications like Genome Graphs in theUCSC genome browser [6] provide a similar plot. The advantage of this layout over the grand linearview is more efficient use of horizontal space, and hence finer resolution detail onthe positions of the features. The trade-off is that the second variable has lessspace - instead of a full vertical axis the information needs to be fit into eachrectangle. Thus, we obtain genomic position resolution at the cost of less data layerresolution. It is common to see this type of plot used for SNP density, varyinglevels of identity-by-descent (IBD) [20] and length of linkage disequilibrium spans [21].
https://static-content.springer.com/image/art%3A10.1186%2Fgb-2012-13-8-r77/MediaObjects/13059_2012_Article_3010_Fig4_HTML.jpg
Figure 4

Stacked karyogram overview. Karyogram plot shows a subset of humanRNA-editing sites, and they are color coded for different regions as follows:red indicates exons, green indicates introns and blue indicates exons/intronsstatus is unknown.

Circular overview

The primary purpose of the circular view is to show links between genomic regions.This is generally infeasible with the linear or karyogram layouts. In a circularlayout, features are organized into concentric rings. Figure 5illustrates the circular overview on the data from a gene fusion study conducted byBass and colleagues [22], who sequenced the genomes of nine individuals with colorectal cancer andidentified an average of 75 somatic rearrangements per tumor sample. This circularview shows only a single sample (colorectal tumor sample CRC-1), the structuralrearrangements are shown as links with intrachromosomal events in green andinterchromosomal translocations in orange. An ideogram of the autosomes is shown inthe outer ring, with somatic mutation and score tracks in the plot. There are manysoftware packages that provide a circular overview plot, including Circos [10], CGView [23] and DNAplotter [24].
https://static-content.springer.com/image/art%3A10.1186%2Fgb-2012-13-8-r77/MediaObjects/13059_2012_Article_3010_Fig5_HTML.jpg
Figure 5

Single sample circular view. DNA structural rearrangements and somaticmutation in a single colorectal tumor sample (CRC-1). The outer ring shows theideogram of the human autosomes, labeled with chromosome numbers and scales.The segments represent the missense somatic mutations. The point tracks showscore and support for rearrangement. The size of the points indicates thenumber of supporting read pairs in the tumor and the y value indicates thescore for each rearrangement. The links represent the rearrangements, whereintrachromosomal events are colored green and interchromosomal events arecolored orange.

Specialized plots

There are some typical types of plots used to examine specific biological questions.This section describes how ggbio builds two of these: a mismatch summary and anedge-linked interval plot.

Mismatch summary

Mismatch summary is one typical way to visualize alignments from sequencing data,especially in the context of variant calling. Other genome browsers, such asIntegrative Genomics Viewer [4], Savant [25] and Artemis [26] render similar plots from BAM and variant call format (VCF) files. Figure6 shows two different summaries of mismatches from a set ofRNA-seq read alignments. The top plot shows one DNA-seq sample from the first phaseof the 1000 Genomes Project [27], represented as a stacked barchart. It provides a detailed view of thecoverage, where the counts of bases that match the reference are indicated by graybars, and the counts of non-reference bases are indicated by a different color thatis specific to the base (A, C, G or T).
https://static-content.springer.com/image/art%3A10.1186%2Fgb-2012-13-8-r77/MediaObjects/13059_2012_Article_3010_Fig6_HTML.jpg
Figure 6

Mismatch summary. An example of a mismatch summary plot, with associatedvariant calls. The top track shows a barchart of reference counts in gray andmismatched counts colored by the nucleotide. The middle track shows SNPs asletters, color coded also by nucleotide. There is one mismatch, 'T', that isdifferent for all of the reads from the 'A' in the reference genome (bottomletter plot).

Edge-linked interval to data views

Interval data, like genes, regulatory sites, read alignments, and so on, aredifferent lengths. Differences in length can be distracting when looking atassociated numerical information. Thus, length is sometimes best ignored, and theinterval treated as an id or categorical variable. Figure 7shows an example. The top plot shows a profile display of expression levels for twosamples, GM12878 and K562, where the genomic position of the exons is treated as acategorical variable, forcing equal width in the plot. This allows us to see exonswhere the expression level is different, without being distracted by the relativeinterval size of the exons. We could also consider this display to be a parallelcoordinate plot [28, 29].
https://static-content.springer.com/image/art%3A10.1186%2Fgb-2012-13-8-r77/MediaObjects/13059_2012_Article_3010_Fig7_HTML.jpg
Figure 7

Edge-linked interval to data view. Edge-linked interval to data view forthe expression of the exons of gene PDIA6. The top track shows theexpression level for each of the exons, and the color indicates the sample(GM12878 or K562). The second track shows the links between the even-spacedexpression track and the exons track, below. The package DEXseq, which producesa similar graphic, computes differential expression and significance, andsignificance is indicated by coloring the connecting lines red. The track atthe bottom shows the annotated transcripts.

Biological extensions to the grammar of graphics

The base grammar

The work introduced in this paper builds upon the grammar of graphics conceived byWilkinson [14] and expanded upon by Wickham [15]. The grammar is composed of interchangeable components that are combinedaccording to a flexible set of rules to produce plots from a wide range of types.Table 1 explains the components of the grammar (data, geom,stat, scales, coord, facet), as utilized in ggplot2, and indicates how these are usedto create two graphics: Figure 8 and 1.
Table 1

Components of basic grammar of graphics

Comp

Explanation

Figure 8 usage

Figure 1 usage

Data

Data to visualize, containing variables and values

A gene expression table

A GRanges object (core data structure in Bioconductor)

Geom

A geometric object draws the data as a graphical primitive. Types ofprimitives include points, lines, polygons or text. Some statistical orcomposite primitives, such as histogram, boxplot and point range, areconsidered to be geoms

Points with color indicating significance of expression (red =significant, black = not)

Alignments (new), Chevron (new)

Stat

A statistical transformation transforms, filters and/or summarizes avariable prior to plotting. For example, binning and counting isnecessary to make a histogram. The default would be an identitytransformation, which does not change the data. In ggplot2 an appropriatedefault transformation is chosen according to the geom, for example, thebin transform for the histogram geom. Thus, the user rarely needs toexplicitly specify one

Identity (computation of M value and A values is done outside of thegrammar)

Steppings (new)

Scales

A scale maps the variables (for example, expression, treatment, gene id)from data space to aesthetics (for example, position, color, area).Scales also control associated guides like axes and legends. Included inscales are numerical transformations such as log or square root ofvariables, so that an axis can be drawn on a log scale, for example. Thedefault is a linear scale

A, the log geometric average, the x axis, and M, the log ratio mapped tothe y axis

Genomic position mapped to position along x axis, and levels mapped to yaxis

Coord

A coordinate system controls how two position scales work together. Thedefault is the Cartesian coordinate system, but others such as a polarcoordinate system could be chosen

Cartesian

Cartesian

Facet

A faceting specification is used to produce small multiples [42] for subsets of the data. In other graphical systems it isknown as latticing [43], trellising [44] or even conditioning

None

None

Layout (new)

A layout is a new grammatical component for controlling how multipleplots are arranged in a figure. It was motivated by the need to displaymultiple genomic annotation data sets simultaneously, and also supportsgenomic overviews

Single

Linear

Components of the basic grammar of graphics, and the extended grammar, and howthey are used in Figures 8 and 1. Figure 9 illustrates how the grammar has beenextended for biological data. Entries marked with 'new' are those developed aspart of this work; the rest are inherited from ggplot2.

https://static-content.springer.com/image/art%3A10.1186%2Fgb-2012-13-8-r77/MediaObjects/13059_2012_Article_3010_Fig8_HTML.jpg
Figure 8

MA-plot. MA-plot for differential expression analysis in four RNA-seqsamples with two cell lines GM12878 and K562, annotated to illustrate the useof the grammar of graphics. Points is our geometric object, x axis indicatesthe normalized mean and the y axis indicates the log2 fold change.Aesthetics mapping took place between the groups and the color to use red toindicate the most significant differently expressed observation (gene). Thisplot uses Cartesian coordinates.

Genomic data and abstractions

Data are the first component of the grammar, and data may be collected in differentways. Wilkinson makes a distinction between empirical data, abstract data andmetadata [14]. Empirical data are collected from observations of the real world, whileabstract data are defined by a formal mathematical model. Metadata are data aboutdata, which might be empirical, abstract or metadata themselves. We will use the termdata source to refer to concrete data in specific databases and file formats. This isroughly analogous to Wilkinson's empirical data.

The ggbio package attempts to automatically load files of specific formats intocommon Bioconductor data structures, using routines provided by Bioconductorpackages, according to Additional file 1 Table S2. Theloaded data are then considered by Wilkinson to be abstract, in that they are nolonger tied to a specific file format. Analogously, a data structure may be createdby any number of algorithms in R; all that matters is that every algorithm returns aresult of the same type. The type of data structure loaded from a file or returned byan algorithm depends on the intrinsic structure of the data. For example, BAM filesare loaded into a GappedAlignments, while FASTA and 2bit sequences result in aDNAStringSet. The ggbio package handles each type of data structure differently,according to Additional file 1 Table S1. In summary, thisabstraction mechanism allows ggbio to handle multiple file formats, withoutdiscarding any intrinsic properties that are critical for effective plotting.

Extension overview

Genomic data have some specific features that are different from those of moreconventional data types, and the basic grammar does not conveniently capture suchaspects. The grammar of graphics is extended by ggbio in several ways, which areillustrated in Figure 9 and described in Additional file1 Table S3. These extensions are specific to genomicdata, that is, genomic sequences and features, like genes, located on thosesequences.
https://static-content.springer.com/image/art%3A10.1186%2Fgb-2012-13-8-r77/MediaObjects/13059_2012_Article_3010_Fig9_HTML.jpg
Figure 9

Diagram of the ggbio framework for processing sequence data. It startswith a mapping from different file types to different objects or data structurein R, using Bioconductor tools, followed by general and extended grammar ofgraphics mapping of data elements to graphical components. The final stagearranges the graphics in a designed layout to show annotation tracks ormultiple data sets. Orange boxes and dark brown arrows indicate the extensionsprovided by ggbio.

Figure 1 illustrates how the components of the grammar arecombined to plot gene structures. A sample of the data is shown in Table 2. The data are passed to ggbio as a GRangesList object. Thechevron geom mimics the typical splice junction diagram found in textbooks and itdraws the introns in the example. The exons are drawn using the rectangle geom, andthe high-level alignment geom ensures that the introns and exons from the sametranscript are drawn connected, according to the tx_id column. The position on thechromosome is mapped to the horizontal axis and strand is mapped to color. Thevertical axis is mapped to a variable generated by the stepping statistic, whichavoids overplotting between the transcripts. We will explain these aspects in thefollowing sections.
Table 2

Example of GRanges object

 

seqnames

ranges

strand

tx_id

exon_id

1

chrX

[48242968], [48243005]

+

35775

132624

2

chrX

[48243475], [48243563]

+

35775

132625

3

chrX

[48244003], [48244117]

+

35775

132626

4

chrX

[48244794], [48244889]

+

35775

132627

5

chrX

[48246753], [48246802]

+

35775

132628

...

...

...

...

......

...

26

chrX

[48270193], [48270307]

-

35778

132637

27

chrX

[48269421], [48269516]

-

35778

132636

28

chrX

[48267508], [48267557]

-

35778

132635

29

chrX

[48262894], [48262998]

-

35778

132633

30

chrX

[48261524], [48262111]

-

35778

132632

Typical biological data coerced into a data frame: a GRanges table representinggenes SSX4 and SSX4B. One row represents one exon, seqnamesindicates the chromosome name, ranges indicates the interval of exons, strandis the direction, tx_id and exon_id are the internal ids used for mappingcross-database.

Biological geometries

A geom is responsible for translating data to a visual, geometric representationaccording to mappings between variables and aesthetic properties on the geom. Incomparison to regular data elements that might be mapped to the ggplot2 geoms ofpoints, lines and polygons, genomic data has the basic currency of a range. Rangesunderlie exons, introns and other features, and the genomic coordinate system formsthe reference frame for biological data. We have introduced or extended several geomsfor representing ranges and gaps between ranges. They are listed in Additional file1 Table S3. For example, the alignment geom delegates totwo other geoms for drawing the ranges and gaps. These default to rectangles andchevrons, respectively. Having specialized geoms for commonly encountered entities,like genes, relegates the tedious coding of primitives, and makes user code simplerand more maintainable.

Biological statistical transformations

A statistical transformation (stat) transforms or summarizes the data in a particularway. These statistics may be mapped to visual aesthetics in the same manner as theoriginal data. In this work we introduce several statistical transformationsspecifically for genomic data as shown in Additional file 1Table S3. For example, given a large number of read alignments, computation ofcoverage is useful, as shown in Figure 10. Thesetrans-formations were implemented with significant help from Bioconductor tools. Newstatistical transformations are readily incorporated.
https://static-content.springer.com/image/art%3A10.1186%2Fgb-2012-13-8-r77/MediaObjects/13059_2012_Article_3010_Fig10_HTML.jpg
Figure 10

Coverage transformation. Statistical transformation, coverage andstepping, are used to summarize short reads data. Top: a set of (simulated)short reads, displayed using the stepping transformation, vertically, and thedefault geom 'rectangle'. Bottom: coverage is shown on the vertical axis, usingthe geom 'area'. This example applies the data model GRanges object.

Biological coordinate transformations

Coordinate systems locate points in space, and we use coordinate transformations tomap from data coordinates to plot coordinates. The most common coordinate system instatistical graphics is cartesian. The transformation of data to cartesiancoordinates involves mapping points onto a plane specified by two perpendicular axes(x and y). Why would two plots transform the coordinates differently for the samedata? The first reason is to simplify, such as changing curvilinear graphics tolinear, and the second reason is to reshape a graphic so that the most importantinformation jumps out at the viewer or can be more accurately perceived [14].

Coordinate transformations are also important in genomic data visualization. Forinstance, features of interest are often small compared to the intervening gaps,especially in gene models. The exons are usually much smaller than the introns. Ifusers are generally interested in viewing exons and associated annotations, we couldsimply cut or shrink the intervening introns to use the plot space efficiently. Forexample, Figure 2 is able to show the entire gene region, withvirtually no loss in data resolution. In ggbio, we propose three sets of coordinatesystems, shown in Additional file 1 Table S3, which areuseful for genomic data.

Biological faceting

Almost all experimental outputs are associated with an experimental design and othermeta-data, for example, cancer types, gender and age. Faceting allows users to subsetthe data by a combination of factors and then lay out multiple plots in a grid, toexplore relationships between factors and other variables. The ggplot2 packagesupports various types of faceting by arbitrary factors. The ggbio package extendsthis notion to facet by a list of ranges of interest, for example, a list of generegions. There is always an implicit faceting by sequence (chromosome), because whenthe x axis is the chromosomal coordinate, it is not sensible to plot data fromdifferent chromosomes on the same plot. As an aside, generating a set of tracks mightresemble faceting, but it is easier to fit into the grammar framework if we think ofit as a post-processing step.

Biological layout

We have also extended the grammar of graphics with an additional component calledlayout, upon which the mapping from genomic coordinates to plot coordinates depends.The default layout simply maps the genomic coordinates to the x axis and facets bychromosome. The currently supported layouts are: linear (genomic coordinates mappedto the x axis), karyogram (each chromosome displayed separately, in an array), andcircular (like linear, except wrapped around in a circle). The high-level genomicoverview plots take advantage of these layout mechanisms.

Low-level grammar-oriented API

For custom use cases, ggbio provides a low-level API that maps more directly tocomponents of the grammar and thus expresses the plot more explicitly. Generallyspeaking, we strive to provide sensible, overridable defaults at the high-level entrypoints, such as autoplot, while still supporting customizability through the low-levelAPI.

All lower level functions have a special prefix to indicate their role in the grammar,like layout, geom, stat, coord, and theme. The objects returned by the low-level API maybe added together via the conventional + syntax. This facilitates the creation of newtypes of plots. A geom in ggplot2 may be extended to work with more biological datamodel, for example, geom rect will automatically figure out the boundary of rectangleswhen the data is a GRanges, as do geom bar, geom segment, and so on. As an example, thefollowing code produces the same plot as the code shown above, using the low-level APIinstead of autoplot:
ggplot ( ) + geom _ arrowrect ( unlist ( grl ) ) + geom _ chevron ( gaps ( unlist ( grl ) ) )

The reader will notice how the low-level code is more descriptive about the compositionof the plot. In this example, it says we start with an empty plot as created by ggplot.We then use geom arrowrect for exons and add a second layer for the gaps using geomchevron.

The low-level API may be used in conjunction with autoplot, via the + syntax. This makesit possible to save a plot as an object in a session and modify it in different wayswhile experimenting. For example, the following code applies a new theme to the existinggraphic object. The theme null function removes the background labels and legend.
p < - autoplot ( grl , aes ( color = strand ) ) p + theme _ null ( )

Materials and methods

The ggbio package is an extension for R, a free cross-platform programming environmentfor statistical analysis and graphics with more than 3, 000 contributed packages. Thepackage depends upon Bioconductor libraries for handling and processing data, includingthe implementation of the statistics in our extension of the grammar. The Bioconductorproject is a collaborative effort to develop software for computational biology andbioinformatics with high-quality packages and documentation [30]. The visualization methods in ggbio depend heavily on the package ggplot2 [15], which implements the grammar of graphics. The new geoms in ggbio areconstructed from primitives defined in ggplot2. We use ggplot2 as the foundation forggbio, due to its principled style, intelligent defaults and explicit orientationtowards the grammar of graphics model. The color schemes in ggbio were derived fromstandard palettes available in R [3133].

The RNA-seq data used in this paper are from ENCODE [34]. Two cell lines, GM12878 (blood, normal, female) and K562 (blood, cancer,female), are selected, and there are two replicates for each sample. The data weremapped against hg19 using Spliced Transcript Alignment and Reconstruction (STAR) [35]. The Bioconductor packages Rsamtools [36] and GenomicRanges [37] were used to import the BAM files and count reads overlapping exons. Thepackage DEXSeq [38] was used to conduct the expression analysis and find the most differentlyexpressed exons. We used the rtracklayer package [39] to import BED format files and cast them into GRanges objects for ggbio. TheDNA-seq BAM files and VCF files used in Figure 6 were downloadedfrom the 1000 Genomes Project [27].

All figures, code and data links are available from the documentation section of theggbio website [40].

Discussion

We have demonstrated how ggbio supports both the convenient construction of typicalgenomic plots, while simultaneously supporting the invention of new types of plots fromlow-level building blocks. Use cases of ggbio range from generating reproducible,exploratory plots in the course of an analysis to the prototyping of new ways of lookingat these complex data. Lessons learned might be applied to the design of more complex,interactive systems. A new package, visnab, is being developed to make interactivegraphics for genomic data [41].

One such lesson learned is the importance of color choices, which are inconsistent inmany existing tools. Color is one of the primary visual clues in a data graphic andneeds to be handled with some intelligence. For example, the ggbio package builds onwell-specified color palettes used in ggplot2 and biovizBase, including one that isbased on the biologically inspired Giemsa stain colors, as shown at the top of Figure2.

Abbreviations

API: 

application programming interface

ENCODE: 

The Encyclopedia of DNA Elements

SNP: 

single nucleotide polymorphism

UCSC: 

University of California Santa Cruz

VCF: 

variantcall format.

Declarations

Acknowledgements

We are grateful to James Koltes, Kadir Kizilkaya and Jim Reecy for sharing theirAngus cattle infectious bovine keratoconjunctivitis data. Tengfei Yin's research hasbeen partially funded by Genentech Research and Early Development, Inc. We areparticularly grateful for the support of Robert Gentleman.

Authors’ Affiliations

(1)
Department of Genetics, Development and Cell Biology, Iowa State University
(2)
Department of Statistics, Iowa State University
(3)
Department of Bioinformatics, Genentech

References

  1. Integrated Genome Browser. [http://bioviz.org/igb/]
  2. Nicol J, Helt G, Blanchard S, Raja A, Loraine A: The Integrated Genome Browser: free software for distribution and exploration ofgenome-scale datasets. Bioinformatics. 2009, 25: 2730-10.1093/bioinformatics/btp472.PubMedPubMed CentralView ArticleGoogle Scholar
  3. Integrative Genomics Viewer. [http://www.broadinstitute.org/igv/]
  4. Robinson J, Thorvaldsdottir H, Winckler W, Guttman M, Lander E, Getz G, Mesirov J: Integrative genomics viewer. Nat Biotechnol. 2011, 29: 24-26. 10.1038/nbt.1754.PubMedPubMed CentralView ArticleGoogle Scholar
  5. Flicek P, Amode M, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, Gordon L, Hendrix M, Hourlier T, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Larsson P, Longden I, McLaren W, Overduin B, Pritchard B, Singh Riat H, Rios D, Ritchie G, Ruer M, Schuster M, et al: Ensembl 2011. Nucleic Acids Res. 2011, 39: D800-10.1093/nar/gkq1064.PubMedPubMed CentralView ArticleGoogle Scholar
  6. Karolchik D, Baertsch R, Diekhans M, Furey T, Hinrichs A, Lu Y, Roskin K, Schwartz M, Sugnet C, Thomas D, Weber R, Haussler D, WJ K: The UCSC genome browser database. Nucleic Acids Res. 2003, 31: 51-54. 10.1093/nar/gkg129.PubMedPubMed CentralView ArticleGoogle Scholar
  7. Stein L, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich J, Harris T, Arva A, Lewis S: The generic genome browser: a building block for a model organism systemdatabase. Genome Res. 2002, 12: 1599-1610. 10.1101/gr.403602.PubMedPubMed CentralView ArticleGoogle Scholar
  8. Down T, Piipari M, Hubbard T: Dalliance: interactive genome viewing on the web. Bioinformatics. 2011, 27: 889-10.1093/bioinformatics/btr020.PubMedPubMed CentralView ArticleGoogle Scholar
  9. DNAnexus. [https://dnanexus.com/]
  10. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones S, Marra M: Circos: an information aesthetic for comparative genomics. Genome Res. 2009, 19: 1639-1645. 10.1101/gr.092759.109.PubMedPubMed CentralView ArticleGoogle Scholar
  11. Durinck S, Bullard J, Spellman P, Dudoit S: GenomeGraphs: integrated genomic data visualization with R. BMC Bioinformatics. 2009, 10: 2-10.1186/1471-2105-10-2.PubMedPubMed CentralView ArticleGoogle Scholar
  12. Hahne F, Durinck S, Ivanek R, Mueller A: Gviz: Plotting data and annotation information along genomic coordinates (Rpackage version 0.99.8). [http://www.bioconductor.org/packages/2.12/bioc/html/Gviz.html]
  13. Ding L, Wendl M, Koboldt D, Mardis E: Analysis of next-generation genomic data in cancer: accom-plishments andchallenges. Hum Mol Genet. 2010, 19: R188-10.1093/hmg/ddq391.PubMedPubMed CentralView ArticleGoogle Scholar
  14. Wilkinson L: The grammar of graphics. Wiley Interdisciplinary Rev Comput Stat. 2005, 2: 673-677.View ArticleGoogle Scholar
  15. Wickham H: ggplot2: Elegant Graphics for Data Analysis. 2009, New York: Springer-Verlag IncView ArticleGoogle Scholar
  16. Bioconductor. [http://www.bioconductor.org/]
  17. Gibson G: Hints of hidden heritability in GWAS. Nat Genet. 2010, 42: 558-560. 10.1038/ng0710-558.PubMedView ArticleGoogle Scholar
  18. Kizilkaya K, Tait R, Garrick D, Fernando R, Reecy J: Whole genome analysis of infectious bovine ker-atoconjunctivitis in Angus cattleusing Bayesian threshold models. BMC Proc. 2011, 5 (Suppl 4): S22-10.1186/1753-6561-5-S4-S22.PubMedPubMed CentralView ArticleGoogle Scholar
  19. Kiran A, Baranov P: DARNED: a DAtabase of RNa EDiting in humans. Bioinformatics. 2010, 26: 1772-1776. 10.1093/bioinformatics/btq285.PubMedView ArticleGoogle Scholar
  20. The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449: 851-861. 10.1038/nature06258.PubMed CentralView ArticleGoogle Scholar
  21. The International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.PubMed CentralView ArticleGoogle Scholar
  22. Bass AJ, Lawrence MS, Brace LE, Ramos AH, Drier Y, Cibulskis K, Sougnez C, Voet D, Saksena G, Sivachenko A, Jing R, Parkin M, Pugh T, Verhaak RG, Stransky N, Boutin AT, Barretina J, Solit DB, Vakiani E, Shao W, Mishina Y, Warmuth M, Jimenez J, Chiang DY, Signoretti S, Kaelin WG, Spardy N, Hahn WC, Hoshida Y, Ogino S, et al: Genomic sequencing of colorectal adenocarcinomas identifies a recurrentVTI1A-TCF7L2 fusion. Nat Genet. 2011, 43: 964-968. 10.1038/ng.936.PubMedPubMed CentralView ArticleGoogle Scholar
  23. Stothard P, Wishart D: Circular genome visualization and exploration using CGView. Bioinformatics. 2005, 21: 537-539. 10.1093/bioinformatics/bti054.PubMedView ArticleGoogle Scholar
  24. Carver T, Thomson N, Bleasby A, Berriman M, Parkhill J: DNAPlotter: circular and linear interactive genome visualization. Bioinformatics. 2009, 25: 119-10.1093/bioinformatics/btn578.PubMedPubMed CentralView ArticleGoogle Scholar
  25. Fiume M, Williams V, Brook A, Brudno M: Savant: genome browser for high-throughput sequencing data. Bioinformatics. 2010, 26: 1938-1944. 10.1093/bioinformatics/btq332.PubMedPubMed CentralView ArticleGoogle Scholar
  26. Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream M, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics. 2000, 16: 944-945. 10.1093/bioinformatics/16.10.944.PubMedView ArticleGoogle Scholar
  27. Consortium TIH: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
  28. Inselberg A: The Plane with Parallel Coordinates. Visual Computer. 1985, 1: 69-91. 10.1007/BF01898350.View ArticleGoogle Scholar
  29. Wegman E: Hyperdimensional data analysis using parallel coordinates. J Am Stat Assoc. 1990, 85: 664-675. 10.1080/01621459.1990.10474926.View ArticleGoogle Scholar
  30. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: Open software development for computational biology andbioinformatics. Genome Biol. 2004, 5: R80-10.1186/gb-2004-5-10-r80.PubMedPubMed CentralView ArticleGoogle Scholar
  31. Lumley T: Color-coding and color blindness in statistical graphics. Stat Computing. 2006, 17: 4-Google Scholar
  32. Neuwirth E: RColorBrewer: ColorBrewer palettes (R package version 1.0-5). [http://CRAN.R-project.org/package=RColorBrewer]
  33. colorspace. [http://cran.r-project.org/web/packages/colorspace/index.html]
  34. ENCODE. [http://genome.ucsc.edu/ENCODE/]
  35. STAR. [http://gingeraslab.cshl.edu/STAR/]
  36. Morgan M, Pages H: Rsamtools: Binary alignment (BAM), variant call (BCF), or tabix file import (Rpackage version 1.9.26). [http://bioconductor.org/packages/release/bioc/html/Rsamtools.html]
  37. Aboyoun P, Pages H, Lawrence M: GenomicRanges: Representation and manipulation of genomic intervals (R packageversion 1.7.36). [http://www.bioconductor.org/packages/2.11/bioc/html/GenomicRanges.html]
  38. Anders S, Reyes A, Huber W: Detecting diferential usage of exons from RNA-seq data. Genome Res. 2012, 22: 2008-2017. 10.1101/gr.133744.111.PubMedPubMed CentralView ArticleGoogle Scholar
  39. Lawrence M, Carey V, Gentleman R: rtracklayer: R interface to genome browsers and their annotation tracks (R packageversion 1.15.7). [http://www.bioconductor.org/packages/2.11/bioc/html/rtracklayer.html]
  40. ggbio. [http://www.tengfei.name/ggbio]
  41. visnab. [https://github.com/tengfei/visnab]
  42. Tufte E: The Visual Display of Quantitative Information. 1983, Cheshire, CT: The Graphics PressGoogle Scholar
  43. Sarkar D: lattice: Lattice graphics (R package version 0.17-22 2009). [http://CRAN.R-project.org/package=lattice]
  44. Becker R, Cleveland WS, Shyu MJ: The visual design and control of trellis displays. J Comput Graphical Stat. 1996, 6: 123-155.Google Scholar

Copyright

© Yin et al.; licensee BioMed Central Ltd. 2012

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), whichpermits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.