The computational problem: rearrangement of genome segments
We formulate the TSV detection problem as the optimization problem of rearranging genome segments to maximize the number of observed reads that are consistent (termed concordant) with the rearranged genome. This approach requires defining the genome segments that can be independently rearranged. It also requires defining which reads are consistent with a particular arrangement of the segments. We will encode both of these (segments and read consistency) within a GSG. Fig. 6 is an example.
Definition 1
(Segment) A segment is a pair s=(s
h
,s
t
), where s represents a continuous sequence in the reference genome. s
h
represents its head and s
t
its tail in reference genome coordinates. In practice, segments will be derived from read locations.
Definition 2
(Genome Segment Graph (GSG)) A GSG G=(V,E,w) is an undirected weighted graph, where V contains both endpoints of each segment in a set of segments S, i.e., V={s
h
:s∈S}∪{s
t
:s∈S}. Thus, each vertex in the GSG represents a location in the genome. An edge (u,v)∈E indicates that there is evidence that the location u is adjacent to location v. The weight function \(w: E \longrightarrow \mathbb {R}^{+}\) represents the reliability of an edge. Generally speaking, the weight is the number of read alignments supporting the edge, but we use a multiplier to calculate edge weights, which is discussed below. In practice, E and w will be derived from split-aligned and paired-end reads.
Defining vertices by endpoints of segments is required to avoid ambiguity. Knowing only that segment i is connected with segment j is not enough to recover the sequence, since different relative positions of i and j spell out different sequences. Instead, for example, an edge (i
t
,j
h
) indicates that the tail of segment i is connected to the head of segment j, and this specifies a unique desired local sequence with the only other possibility being the reverse complement (i.e., it could be that the true sequence is i·j or rev(j)· rev(i); here · indicates concatenation and rev(i) is the reverse complement of segment i).
A GSG is similar to a breakpoint graph [51] but with critical differences. A breakpoint graph has edges representing connections both in the reference genome and in the target genome. Edges in a GSG represent only the target genome, and they can be either concordant or discordant. In addition, a GSG does not require that the degree of every vertex is 2, and thus, alternative splicing and erroneous edges can exist in a GSG.
Our goal is to reorder and reorient the segments in S so that as many edges in G are compatible with the rearranged genome as possible.
Definition 3
(Permutation) A permutation π on a set of segments S projects a segment in S to a set of integers from 1 to |S| (the size of S), representing the indices of the segments in an ordering of S. In other words, each permutation π defines a new order of segments in S.
Definition 4
(Orientation Function) An orientation function f maps both ends of a segment to 0 or 1:
$$ f:\{s_{h}: s\in S\}\cup\{s_{t}: s\in S\} \longrightarrow \{0,1\}, $$
subject to f(s
h
)+f(s
t
)=1 for all s=(s
h
,s
t
)∈S. An orientation function specifies the orientations of all segments in S. Specifically, f(s
h
)=1 means s
h
goes first and s
t
next, corresponding to the forward strand of the segment, and f(s
t
)=1 corresponds to the reverse strand of the segment.
With a permutation π and an orientation function f, the exact and unique sequence of a genome is determined. The reference genome also corresponds to a permutation and an orientation function, where the permutation is the identity permutation and the orientation function maps all s
h
to 1 and all s
t
to 0.
Definition 5
(Edge Compatibility) Given a set of segments S, a GSG G=(V,E,w), a permutation π on S, and an orientation function f, an edge e=(u
i
,v
j
)∈E, where u
i
∈{u
h
,u
t
} and v
j
∈{v
h
,v
t
}, is compatible with permutation π and orientation f if and only if
$$ 1-f(v_{j})=\mathbf{1}\left[\pi(v)<\pi(u)\right]=f(u_{i}), $$
(1)
where 1[x] is an indicator function, which is 1 if x is true and 0 otherwise. Comparison between permuted elements is defined as comparing their index in permutation, that is, π(v)<π(u) states that segment v is in front of segment u in rearrangement π. We write e∼(π,f) if e is compatible with π and f.
The above two edge compatibility Eqs. 1 require that, for an edge to be compatible with the rearranged and reoriented sequence determined by π and f, it needs to connect the right side of the segment in front to the left side of the segment following it. As we will see below, the edges of a GSG are derived from read alignments. An edge being compatible with π and f is essentially equivalent to stating that the corresponding read alignments are concordant with respect to the target genome determined by π and f. When (π,f) is clear, we refer to edges that are compatible as concordant edges and edges that are incompatible as discordant edges.
With the above definitions, we formulate an optimization problem as follows:
Problem 1
Input: A set of segments S and a GSG G=(V,E,w).Output: Permutation π on S and orientation function f that maximizes:
$$ \max_{\pi, f} \sum_{e\in E} w(e) \cdot \mathbf{1}\left[e\sim(\pi,f)\right]. $$
(2)
This objective function tries to find a rearrangement of genome segments (π,f) such that when aligning reads to the rearranged sequence, as many reads as possible will be aligned concordantly. This objective function includes both concordant alignments and discordant alignments and sets them in competition, which is effective in reducing the number of false positives when tumor transcripts outnumber normal transcripts. There is the possibility that some rearranged tumor transcripts will be outnumbered by normal counterparts. To be able to detect TSVs in this case, depending on the setting, we may weight discordant read alignments more than concordant read alignments. Specifically, for each discordant edge e, we multiply the weight w(e) by a constant α that represents our estimate of the ratio of normal transcripts over tumor counterparts.
The final TSVs are modeled as pairs of breakpoints. Denote the permutation and orientation corresponding to an optimally rearranged genome as (π∗,f∗) and those that correspond to the reference genome as (π0,f0). An edge e can be predicted as a TSV if e∼(π∗,f∗) and \(e\nsim (\pi _{0},f_{0})\).
Integer linear programming formulation
We use ILP to compute an optimal solution (π∗,f∗) of Problem 1. To do this, we introduce the following Boolean variables:
-
x
e
=1 if edge e∼(π∗,f∗) and 0 if not.
-
z
uv
=1 if segment u is before v in the permutation π∗ and 0 otherwise.
-
y
u
=1 if f∗(u
h
)=1 for segment u.
With this representation, the objective function can be rewritten as
$$ \max_{x_{e}, y_{u}, z_{uv}} w(e) \cdot x_{e}. $$
(3)
We add constraints to the ILP derived from edge compatibility Eq. 1. Without loss of generality, we first suppose segment u is in front of v in the reference genome, and edge e connects u
t
and v
h
(which is a tail–head connection). Plugging in u
t
, the first equation in (1) is equivalent to 1−1[π(u)>π(v)]=1−f(u
t
) and can be rewritten as 1[π(u)<π(v)]=f(u
h
)=y
u
. Note that 1[π(u)<π(v)] has the same meaning as z
uv
; it leads to the constraint z
uv
=y
u
. Similarly, the second equation in (1) indicates z
uv
=y
v
. Therefore, x
e
can reach 1 only when y
u
=y
v
=z
uv
. This is equivalent to the inequalities (4) below. Analogously, we can write constraints for the other three types of edge connections: tail–tail connections impose inequalities (5), head–head connections impose inequalities (6), and head–tail connections impose inequalities (7):
$$ \begin{aligned} x_{e} &\leq y_{u}-y_{v}+1, \\ x_{e} &\leq y_{v}-y_{u}+1, \\ x_{e} &\leq y_{u}-z_{uv}+1, \\ x_{e} &\leq z_{uv}-y_{u}+1, \end{aligned} $$
(4)
$$ \begin{aligned} x_{e} &\leq y_{u}-(1-y_{v})+1, \\ x_{e} &\leq (1-y_{v})-y_{u}+1, \\ x_{e} &\leq y_{u}-z_{uv}+1, \\ x_{e} &\leq z_{uv}-y_{u}+1, \end{aligned} $$
(5)
$$ \begin{aligned} x_{e} &\leq (1-y_{u})-y_{v}+1, \\ x_{e} &\leq y_{v}-(1-y_{u})+1, \\ x_{e} &\leq (1-y_{u})-z_{uv}+1, \\ x_{e} &\leq z_{uv}-(1-y_{u})+1, \end{aligned} $$
(6)
$$ \begin{aligned} x_{e} &\leq (1-y_{u})-(1-y_{v})+1, \\ x_{e} &\leq (1-y_{v})-(1-y_{u})+1, \\ x_{e} &\leq (1-y_{u})-z_{uv}+1, \\ x_{e} &\leq z_{uv}-(1-y_{u})+1. \end{aligned} $$
(7)
We also add constraints to enforce that z
uv
forms a valid topological ordering. For each pair of nodes u and v, one must be in front of the other, that is z
uv
+z
vu
=1. In addition, for each triple of nodes, u, v, and w, one must be at the beginning and one must be at the end. Therefore, we add 1≤z
uv
+z
vw
+z
wu
≤2.
Solving an ILP in theory takes exponential time, but in practice, solving the above ILP to rearrange genome segments is very efficient. The key is that we can solve for each connected component separately. Because the objective maximizes the sum of compatible edge weights, the best rearrangement of one connected component is independent of the rearrangement of another because, by definition, there are no edges between connected components.
Concordant and discordant alignments
Discordant alignments are alignments of reads that contradict the library preparation in sequencing. Concordant alignments are alignments of reads that agree with the library preparation. Take Illumina sequencing as an example. For a paired-end read alignment to be concordant, one end should be aligned to the forward strand and the other to the reverse strand, and the forward strand aligning position should be in front of the reverse strand aligning position (Fig. 7a). Concordant alignment traditionally used in WGS also requires that a read cannot be split and aligned to different locations. However, these requirements are invalid in RNA-seq alignments because alignments of reads can be separated by an intron with unknown length.
We define concordance criteria separately for split-alignment and paired-end alignment. If one end of a paired-end read is split into several parts and each part is aligned to a location, the end has split alignments. Denote the vector of the split alignments of an end as R=[A1,A2,…,A
r
] (r depends on the number of splits). Each alignment R[i]=A
i
has four components: a chromosome (Chr), an alignment starting position (Spos), an alignment ending position, and an orientation (Ori, with value either + or −). We require that the alignments A
i
are sorted by their position in the read. A split-aligned end R=[A1,A2,…,A
r
] is concordant if all the following conditions hold:
$$ \begin{aligned} A_{i}.\text{Chr} &= A_{j}.\text{Chr}, && \text{\(\forall i, \forall j\)}, \\ A_{i}.\text{Ori} &= A_{j}.\text{Ori}, && \text{\(\forall i, \forall j\)}, \\ A_{i}.\text{Spos} &< A_{j}.\text{Spos}, && \text{if \(A_{i}.\text{Ori}=+\) for all \(i<j\)}, \\ A_{i}.\text{Spos} &>A_{j}.\text{Spos}, && \text{if \(A_{i}.\text{Ori}=-\) for all \(i<j\)}. \end{aligned} $$
(8)
If the end is not split, but continuously aligned, the alignment automatically satisfies Eq. 8. Denote the alignments of R’s mate as M=[B1,B2,…,B
m
]. An alignment of the paired-end read is concordant if the following conditions all hold:
$$ \begin{aligned} A_{i}.\text{Chr} &= B_{j}.\text{Chr}, && \text{\(\forall i, \forall j\)}, \\ A_{i}.\text{Ori} &\neq B_{j}.\text{Ori}, && \text{\(\forall i, \forall j\)}, \\ A_{1}.\text{Spos} &< B_{m}.\text{Spos}, && \text{if \(A_{1}.\text{Ori}=+\)}, \\ A_{m}.\text{Spos} &> B_{1}.\text{Spos}, && \text{if \(A_{1}.\text{Ori}=-\)}. \end{aligned} $$
(9)
We require only that the leftmost split of the forward read R is in front of the leftmost split of the reverse read M, since the two ends in a read pair may overlap. For a paired-end read to be concordant, each end should satisfy split-read alignment concordance (8), and the pair should satisfy paired-end alignment concordance (9).
Splitting the genome into segments S
We use a set of breakpoints to partition the genome. The set of breakpoints contains two types of positions: (1) the start position and end position of each interval of overlapping discordant alignments and (2) an arbitrary position in each 0-coverage region.
Ideally, both ends of a discordant read should be in separate segments, otherwise, a discordant read in a single segment will always be discordant, no matter how the segments are rearranged. Assuming discordant read alignments of each TSV pile up around the breakpoints and do not overlap with the discordant alignments of other TSVs, we set a breakpoint on the start and end positions of each contiguous interval of overlapping discordant alignments.
Each segment that contains discordant read alignments may also contain concordant alignments that connect the segment to its adjacent segments. To avoid having all segments in a GSG connected to their adjacent segments and thus, creating one big connected component, we pick the starting point of each 0-coverage region as a breakpoint. By adding those breakpoints, different genes will be in separate connected components unless some discordant reads support their connection. Overall, the size of each connected component is not very large. The number of nodes generated by each gene is approximately the number of exons located in them and these gene subgraphs are connected only when there is a potential TSV between them.
Defining edges and filtering out obvious false positives
In a GSG, an edge is added between two vertices when there are reads supporting the connection. For each read spanning different segments, we build an edge such that when traversing the segments along the edge, the read is concordant with the new sequence [Eqs. (8) and (9)]. Examples of deriving an edge from a read alignment are given in Fig. 7. In this way, the concordance of an alignment and the compatibility of an edge with respect to a genome sequence are equivalent.
The weight of a concordant edge is the number of read alignments supporting the connection, while the weight of a discordant edge is the number of supporting alignments multiplied by the discordant edge weight coefficient α. The discordant edge weight coefficient α represents the normal/tumor cell ratio (for a complete table of SQUID parameters, see Additional file 1: Table S1). If normal transcripts dominate tumor transcripts, α enlarges the discordant edge weights and helps to satisfy the discordant edges in the rearrangement of the ILP.
We filter out obvious false positive edges to reduce both the ILP computation time and the mistakes after the ILP. Edges with very low read support are likely to be a result of alignment error, therefore, we filter out edges with a weight lower than a threshold θ. Segments with too many connections to other regions are likely to have low mappability, so we also filter out segments connecting to more than γ other segments. The parameters α, θ, and γ are the most important user-defined parameters in SQUID (Additional file 1: Table S1, Figures S2 and S3). An interleaving structure of exons from different regions (different genes) seems more likely to be a result of sequencing or alignment error rather than an SV. Thus, we filter out the interleaving edges between two such groups of segments.
Identifying TSV breakpoint locations
Edges that are discordant in the reference genome indicate potential rearrangements in transcripts. Among those edges, some are compatible with the permutation and orientation from ILP. These edges are taken to be the predicted TSVs. For each edge that is discordant initially but compatible with the optimal rearrangement found by ILP, we examine the discordant read alignments to determine the exact breakpoint within related segments. Specifically, for each end of a discordant alignment, if there are two other read alignments that start or end in the same position and support the same edge, then the end of the discordant alignment is predicted to be the exact TSV breakpoint. Otherwise, the boundary of the corresponding segment will be output as the exact TSV breakpoint.
Simulation methodology
Simulations with randomly added SVs and simulated RNA-seq reads were used to evaluate SQUID’s performance in situations with a known correct answer. RSVsim [52] was used to simulate SVs in the human genome (Ensembl 87 or hg38) [53]. We use the five longest chromosomes for simulation (chromosome 1 to chromosome 5). RSVsim introduces five different types of SVs: deletion, inversion, insertion, duplication, and inter-chromosomal translocation. To vary the complexity of the resulting inference problem, we simulated genomes with 200 SVs of each type, 500 SVs of each type, and 800 SVs of each type. We generated four replicates for each level of SV complexity (200, 500, and 800). For inter-chromosomal translocations, we simulate only two events because only five chromosomes were used.
In the simulated genome with SVs, the original gene annotations are not applicable, and we cannot simulate gene expression from the rearranged genome. Therefore, for testing, we interchange the roles of the reference (hg38) and the rearranged genome, and use the new genome as the reference genome for alignment, and hg38 with the original annotated gene positions as the target genome for sequencing. Flux Simulator [54] was used to simulate RNA-seq reads from the hg38 genome using Ensembl annotation version 87 [55].
After simulating SVs in the genome, we need to transform the SVs into a set of TSVs, because not all SVs affect the transcriptome, and thus, not all SVs can be detected by RNA-seq. To derive a list of TSVs, we compare the positions of simulated SVs with the gene annotation. If a gene is affected by an SV, some adjacent nucleotides in the corresponding transcript may be in a far part of the RSVsim-generated genome. The adjacent nucleotides may be consecutive nucleotides inside an exon if the breakpoint breaks the exon, or the endpoints of two adjacent exons if the breakpoint hits the intron. So for each SV that hits a gene, we find the pair of nucleotides that are adjacent in the transcript and separated by the breakpoints, and convert them into the coordinates of the RSVsim-generated genome, thus, deriving the TSV.
We compare SQUID to the pipeline of de novo transcriptome assembly and transcript-to-genome alignment. We also use the same set of simulations to test whether existing WGS-based SV detection methods can be directly applied to RNA-seq data. For the de novo transcriptome assembly and transcript-to-genome alignment pipeline, we use all combinations of the existing software Trinity [23], Trans-ABySS [22], GMAP [27], and MUMmer3 [26]. For WGS-based SV detection methods, we test LUMPY [7] and DELLY2 [6]. We test both STAR [56] and SpeedSeq [33] (which is based on BWA-MEM [57]) to align RNA-seq reads to the genome. LUMPY is compatible only with the output of SpeedSeq, so we do not test it with STAR alignments.