CIDANE: comprehensive isoform discovery and abundance estimation

We present CIDANE, a novel framework for genome-based transcript reconstruction and quantification from RNA-seq reads. CIDANE assembles transcripts efficiently with significantly higher sensitivity and precision than existing tools. Its algorithmic core not only reconstructs transcripts ab initio, but also allows the use of the growing annotation of known splice sites, transcription start and end sites, or full-length transcripts, which are available for most model organisms. CIDANE supports the integrated analysis of RNA-seq and additional gene-boundary data and recovers splice junctions that are invisible to other methods. CIDANE is available at http://ccb.jhu.edu/software/cidane/. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0865-0) contains supplementary material, which is available to authorized users.


Splicing graph
A splicing graph [1] represents connectivity properties (edges) of genomic intervals (nodes) inferred from aligned reads. For a given gene or locus let segments in S be numbered from 1 to |S| and ordered by their genomic position in 5 → 3 direction. We construct a splicing graph G S = (V, E) by adding a vertex v i ∈ V for every segment s i ∈ S and two additional start and end vertices v a and v b . Let f irst(s) and last(s) denote the first and last segment in the sequence of segmentss, respectively, and let denote the substring relation. Further, we denote a sequence of segments by " . . . ". Given the set of segment covers C, CIDANE adds a directed edge (v i , v j ) to E if i < j and one of the following holds: Figure S1: Splicing graph generated for the segment covers shown in Figure S8 in Additional file 2. The graph encodes four transcripts: t 1 = s 1 , s 2 , s 3 , s 4 , t 2 = s 1 , s 2 , s 3 , s 5 , t 3 = s 1 , s 3 , s 4 and t 4 = s 1 , s 3 , s 5 .
where threshold parameter γ (default: γ = 1) denotes a lower bound on the number of mapped reads required to support a splice junction. Set E contains an edge between neighboring segments that are not separated by an intron (condition (i)). Condition (ii) implies an edge if a sufficient number of reads was mapped across the associated splice junction. According to (iii), consecutive segments s i and s i+1 are connected by an edge if the associated splice junction is induced by enclosing paired-end reads.
Since candidate transcripts must start at a transcription start site and end at a transcription end site, we add an edge from v a to all segments in T SS and an edge from every segment in T ES to v b . Figure S1 depicts an example splicing graph for the segment covers shown in Figure S8 in Additional file 2. Candidate transcripts considered by CIDANE in Phase I then correspond to all paths from v a to v b in G S . From leftmost and rightmost starting positions (lpos 1 , rpos 1 ) fors and (lpos 2 , rpos 2 ) for s we can now infer t,c (f ): After defining ∆ in and ∆ out (see Figure S3) as: ∆ in := lpos 2 − rpos 1 and ∆ out := rpos 2 − lpos 1 , we finally obtain The length of cDNA fragments, however, is not uniform but typically modeled by a Normal distribution or as truncated Exponential [2]. The current implementation of CIDANE models the fragment length by a Normal distribution with density D, but can easily incorporate alternative fragment length distributions. Then, for a segment cover c and transcript t we compute the adjusted segment cover length t,c as follows: Figure S3: Illustration of values ∆ in and ∆ out for leftmost and rightmost starting positions (lpos 1 , rpos 1 ) and (lpos 2 , rpos 2 ).
whereˇ andˆ are the minimum and maximum considered fragment lengths, respectively, which limit the lower and upper 5%-quantile, respectively, of the estimated fragment length distribution.

Piecewise linear approximation of least squares
We approximate the quadratic objective (2) given in the main text by a piecwise linear function as follows. For every c i ∈ C we approximate g(x) = x 2 at supporting points p i j ∈ R, j = 1, . . . , k, by its tangent line: Due to its convexity, each of g p i 1 (x), . . . , g p i k (x) constitutes a lower bound on g(x). We select the tightest lower boundẽ i ≥ max 1≤j≤k g p i j (e i ), which yields the linear program formulation: The approximation error of g p i j (x) is given by (x − p i j ) 2 . Thus, if we want to bound the error of approximating (e i / max{ , b i }) 2 to µ > 0, it must hold for j = 2, . . . , k that In a predefined interval [a l , a r ] this requires k = (a r − a l )/(2 µ max{ , b i }) supporting points p 1 , . . . , p k , with p j = a l + (2j − 1) µ max{ , b i }.
Empirically we have chosen δ := 2 in a l = −δb i , a r = δb i , and µ := 0.001 in the current implementation of CIDANE. Parameters δ and µ can be easily adjusted to the quality of the data through corresponding command-line arguments.

Pricing for general covers
Similar to the graph introduced for the case of single exon spanning reads in Section 4.2 of the main text, we construct a hypergraph G = (V, E). For every segment cover (s i ,s i , b i ) ∈ C, however, a hyperedge e i ∈ E comprises all segments ins i ands i , i.e. e i = {v j ∈ V | s j ∈s i ∨ s j ∈s i }. Generalizing the definition in the main text,V (e i ) contains all vertices representing segments located between the last segment s l ofs i , and the first segment s f Again, edge weights w e : P(V (e)) → R model the summands on the left-hand side of equation (4) in the main text. Further, we define ξ(e i ) as the set of vertices with associated segments being spanned by eithers i ors i : where f 1 (f 2 ) and l 1 (l 2 ) are the first and last segments ofs i (s i ), respectively. Compared to Heaviest Isoform problem defined in the main text, we require an edge to be strictly induced by T ⊆ V in order to contribute to the total weight. An edge e is strictly induced by T ⊆ V , if the sequence of segments corresponding to vertices in T containss i ands i as substrings, i.e. ∀v ∈ e : v ∈ T ∧ ∀v ∈ ξ(e) \ e : v / ∈ T.
Using the same set of variables, only minor modifications to the ILP formulation given in the main text are necessary to capture the general Heaviest Isoform problem: w e,j y e,j The Heaviest Isoform Problem is NP-complete Proof. We devise a polynomial-time reduction from the NP-complete Independent Set (IS) problem. Given a graph G = (V, E) and an integer K, the IS problem asks whether there exists a subset of at least K vertices none of which are connected by an edge in E. From G we construct a weighted graph G by introducing, for every vertex v ∈ V , a duplicate vertex v and by connecting v and v by an edge e = (v, v ) of weight w e = 1. To every edge in E we assign weight −1. Then there exists an induced subgraph of G of weight at least K iff there exists an independent set in G of size at least K: Given an independent set I, the subgraph induced by the vertices in I and all their duplicates has weight |I|. On the other hand, let a subgraph of G of weight at least K be induced by a vertex set S. For every induced edge of weight −1 we arbitrarily remove one of its two vertices from S, inducing at most one edge of weight 1 less. The resulting set S induces a subgraph of weight at least the weight of the subgraph induced by S and does not contain any edges of weight −1. Thus, set V ∩ S is independent and contains at least K vertices.
we only consider elements inV (e) with which we obtain through an efficient splicing graph based backtracking scheme.