Myrna computational design
Myrna is designed to run in one of three modes: 'Cloud mode' using Amazon Elastic MapReduce; 'Hadoop mode' using a Hadoop cluster; or 'Singleton mode' using a single computer. Cloud mode requires that the user have appropriate accounts and credentials set up beforehand. Cloud mode does not require any special software installation; the appropriate software is either pre-installed or automatically installed on the EC2 instances before Myrna is run. Hadoop mode requires a functioning Hadoop cluster, with Bowtie, R and Bioconductor installed on all nodes. Singleton mode requires Bowtie, R and Bioconductor to be installed on the computer, but does not require Hadoop. Singleton mode is also parallelized and can exploit a user-specified number of processors.
Myrna is designed with the Apache Hadoop [33] open source implementation of the MapReduce [34] programming model in mind. The pipeline is expressed as a series of map and reduce stages operating on 'tuples' of data. A tuple is a key/value pair, roughly analogous to a row in a database table. A map stage takes a stream of input tuples, performs a computation and outputs a stream of tuples. A reduce stage takes a stream of bundles of 'alike' tuples, where tuples are alike if their primary keys are equal. The reduce stage then performs a computation and outputs a stream of tuples. Between the map and reduce phases, the infrastructure (Hadoop in the case of the Cloud or Hadoop modes, Myrna in the case of Singleton mode) automatically executes a sort/shuffle phase that bins and sorts tuples according to primary and secondary keys, respectively, and passes the sorted bins on to the reducers. Map and reduce stages must be simple and self-contained. They cannot communicate extensively or make heavy use of global data structures. This leaves Hadoop/Myrna with significant freedom in how it distributes parallel tasks across cluster nodes and/or processors.
Myrna workflow
Preprocess
Myrna's workflow is depicted in Figure 1. Each stage exploits a different type of parallelism with the aim of maximizing scalability. The first stage ('Preprocess') preprocesses a list of FASTQ files containing the input reads and installs the result on a filesystem visible to the cluster. Reads are also annotated with metadata, including the read's user-assigned sample name and the name of the file where it originated. This stage is parallel across input files, that is, files are downloaded and preprocessed simultaneously in parallel where possible.
Align
The second stage ('Align'; Figure 1a) aligns reads to a reference genome using Bowtie [24]. Bowtie employs a compact index of the reference sequence, requiring about 3 gigabytes of memory for the human genome. Each computer in the cluster independently obtains the index from a local or shared filesystem. When running on EC2, the index obtained here will typically be one of the pre-built indexes available publicly in S3. The user may specify options to be passed to Bowtie in this stage; the default is '-m 1', which discards alignments for reads that align multiple places. The alignment stage is parallel across reads; that is, reads are aligned simultaneously in parallel where possible.
Overlap
The third stage ('Overlap'; Figure 1b) calculates overlaps between alignments from the Align stage and a pre-defined collection of gene interval sets. In each instance where the 3'-most base of an alignment overlaps any base of a gene interval set, an overlap record associating the (labeled) alignment with the gene is output. By default, Myrna defines a gene interval set as the minimal set of intervals such that all contained bases are covered by all transcripts annotated for the gene. Intervals where two or more genes overlap are omitted from all gene interval sets. This is equivalent to the 'union intersection' model proposed previously [4]. Myrna allows the user to specify other models, such as the 'union' model whereby the interval set consists of the minimal set of intervals such that all contained bases are included in any exon annotation for the gene. Also, Myrna allows the user to specify which portion of the alignment to consider when overlapping with the gene interval set; for instance, instead of the 3'-most base the user can specify that the 5'-most five bases be used. The Overlap stage is parallel across alignments; that is, overlaps for distinct alignments are calculated simultaneously and in parallel where possible.
Normalize
The fourth stage ('Normalize'; Figure 1c) constructs a sorted vector of per-gene overlap counts for each label. A normalization factor is then calculated for each label - typically a quantile of the sample-specific gene count distribution. By default, Myrna sets the factor to the 75th percentile of the distribution of non-zero gene counts, as suggested previously [4]. Alternatively, the user may specify that Myrna use a different quantile or value, such as the median or total, as the normalization factor. The Normalize stage is parallel across labels.
Statistical analysis
The fifth stage ('Statistics'; Figure 1d) examines counts for each gene and calculates and outputs a P-value describing the probability that differences in counts observed between groups are due to chance. The Align and Overlap stages already calculated a count, c
ij
representing the number of times a read from sample j overlapped gene i. The differential expression test relates the counts to an outcome y
j
for the jth sample. The Normalization stage already calculated the 75th percentile, q
j
75, or another suitable summary of the count distribution for each sample.
The basic approach to differential expression is to fit a generalized linear model relating the counts c
ij
to the outcome y
j
:
where g(·) specifies a link function (identity for Normal models, log for Poisson models) and f(·) is a transformation of the raw count data (identity for Poisson models, log for Normal models). The functions s
k
(·) can be used to specify: (1) a continuous relationship between the counts and the outcome, by setting K = 1 and s
k
(·) to be the identify function; or (2) a factor model by setting K = # of groups and s
k
(·) = 1(y
j
= k). Myrna allows the user to specify either the Gaussian or Poisson family of distributions for the generalized linear model. The normalization term, log(q), can be included as an offset [4], in which case η
i
= 1 for all i. The default setting of Myrna is to use the 75th percentile of the count distribution for each sample as the normalization factor so q = q
j
75.
Myrna tests the hypotheses:
The hypothesis test can be performed using an asymptotic likelihood ratio test, or a permutation procedure. The permutation test is performed by first calculating the likelihood ratio statistic, D
i
, for testing H
0i
versus H
1i
for each gene. The outcome y
j
is randomly permuted B times; for each permutation the same procedure is applied to calculate null statistics D
i
0b, b = 1,...,B and i = 1,...,m where m is the total number of genes. Alternative statistics, like the trimmed mean statistic [9], can be implemented to try to address well known issues in RNA-Seq analysis, such as transcript length bias [27].
The Statistics stage is parallel across genes; that is, differential-expression P-values (both observed and null) for distinct genes are calculated simultaneously and in parallel where possible.
Summarize
The sixth stage ('Summarize') examines a sorted list of all P-values generated in the Statistics stage and compiles a list of the top N genes ranked by false discovery rate, where the parameter N is set by the user. In addition to the global significance results, more detailed statistical results and figures (see Postprocessing) are returned for the top N genes.
If a permutation test is used, the Summarize stage additionally calculates the permutation P-values. Permutation P-values are calculated as follows:
This is accomplished over the course of a single linear scan of the list of observed and null statistics, sorted by statistic. The parallel infrastructure (either Hadoop or Myrna) takes care of the sorting.
Though there is a modest amount of exploitable parallelism inherent in this task, Myrna performs the Summarize stage serially (on a single processor). The lack of parallelism is mitigated by the fact that there are typically only on the order of tens of thousands or hundreds of thousands of observed and null P-values to examine in this stage.
Postprocess
The seventh stage ('Postprocess') first discards all overlap records not belonging to any top genes, which it does in parallel across all overlaps. Next, Myrna calculates per-gene Q-values, a false discovery rate analog of P-values [35]. The user specifies N whereby the N genes with the smallest P-values are considered 'top' genes. Finally, Myrna outputs a series of output files, including: (a) files listing all overlaps for each top gene, including alignment information that might indicate the presence of sequence variants, such as single-nucleotide polymorphisms; (b) a table with estimated RPKM values for each gene in the annotation; (c) a sorted table of all P-values for all genes, along with a histogram plot; (d) a sorted table of all q-values for all genes; and (e) a series of plots showing the coverage for each of the top N genes, broken down by replicate and by group. These results are then compressed and stored in the user-specified output directory.
Some stages of the Myrna pipeline may be run separately. For instance, a user may wish to preprocess a set of input reads once, then re-analyze them several times, in which case the Preprocess phase need be run only once, and the Align through Post-process stages can be re-run for subsequent analyses.