SQUID: transcriptomic structural variation detection from RNA-seq

Transcripts are frequently modified by structural variations, which lead to fused transcripts of either multiple genes, known as a fusion gene, or a gene and a previously non-transcribed sequence. Detecting these modifications, called transcriptomic structural variations (TSVs), especially in cancer tumor sequencing, is an important and challenging computational problem. We introduce SQUID, a novel algorithm to predict both fusion-gene and non-fusion-gene TSVs accurately from RNA-seq alignments. SQUID unifies both concordant and discordant read alignments into one model and doubles the precision on simulation data compared to other approaches. Using SQUID, we identify novel non-fusion-gene TSVs on TCGA samples. Electronic supplementary material The online version of this article (10.1186/s13059-018-1421-5) contains supplementary material, which is available to authorized users.

For the pipeline of de novo transcriptome assembly and transcript-to-genome alignment, the direct output is a series of alignment pieces for each assembled transcript. To derive TSV from the pieces of alignment of each transcript, we still need to use the split-read alignment concordance criteria (8) and the edge-building approach. In the case of no TSV, equation (8) still holds, since a transcript is generated from one strand of one chromosome, without rearrangements but only deletion of introns. Any violation of (8) is treated as a TSV. Here TSVs are still able to be represented by edges in GSG, where segments are the intervals of each piece of alignment, and edges are added in the same principle that traversing segments along the edges will result in a concordant alignment of the assembled transcript. The positions of both breakpoints in a TSV are exactly the two positions linked by the discordant edge, and the orientations corresponds to the connection type of the edge.

Processing TCGA RNA-seq data
We use STAR aligner [56] to align TCGA RNA-seq reads to Ensemble genome 87 [53] with the corresponding gene annotation. STAR aligner [56] is set with the option of outputting chimeric alignments with hanging length 15bp. The chimeric alignments generated by STAR [56] are further filtered out if the paired-end reads can be aligned concordantly by SpeedSeq aligner [33]. SQUID is applied to concordant alignments generated by STAR [56] and the filtered chimeric alignments. The discordant edge weight coefficient α is set to be 1, that is, we require tumor transcripts to dominate normal transcripts (if they are incompatible) in order to predict corresponding TSVs.
A large number of fusions between immunoglobulin genes are predicted by SQUID. However, there is possibility that B cells are in the mixture of sequencing and have very high expression of immunoglobulin genes (Ig). We cannot tell whether Ig rearrangements are generated by tumor cells or B cells. Therefore, we exclude Ig TSVs during post-processing and exclude them from the descriptive statistics. Note that SQUID does not exclude Ig TSVs internally, because Ig expression and VDJ recombination have been observed to exist in tumor cells, and revealing the role of Ig in tumors may be useful. When normal cells are removed from tumor samples, using SQUID to predict Ig TSVs may help study relationship between Ig and cancer. low Phred quality threshold 4 (p = 10 −0.4 ) l maximum allowed low Phred quality length 10 Note: mq, pq and l are controls for sequencing quality and mapping quality. If mapping quality of a read is less then threshold mq, the read will not be used in edge building. If the read has a low sequencing, in terms of having more than l bases of sequencing quality lower than pq, the read will not be used in edge building.   Figure S5: IGV visualization of non-fusion-gene TSV involving ZFHX3 gene. The reference sequence showed by IGV is the junction sequence of TSV. The first track shows the exons of ZFHX3 gene in the junction sequence. The second track shows the boundaries of the fused genome segments. In the alignment track, read alignments are viewed as pairs (the grey line links two paired-end alignments). Coverage of intergenic segment is less than coverage of ZFHX3 gene, which indicates the TSV is heterogeneous and appears in a portion of sequencing sample.

MYLK3 ZFHX3
MYLK3_ZFHX3_converted.gtf geneboundary.bed   Figure S7: IGV visualization of non-fusion-gene TSV involving ASXL1 gene. The reference sequence is the junction sequence of TSV. The first track shows the exons of ASXL1 gene in the junction sequence. The second track shows the boundaries of the fused genome segments. In the alignment track, read alignments are viewed as pairs (the grey line links two paired-end alignments). There are many reads spanning the junction point, and the coverage difference between the two segments is small, which implies the TSV is possibly homogeneous.  Figure S8: IGV visualization of fusion-gene TSV involving ASXL1 and PDRG1 genes. The reference sequence is the junction sequence of TSV. The first two annotation tracks show the exons of ASXL1 and PDRG1 gene in the junction sequence. The third track shows the boundaries of the fused genome segments.
In the alignment track, read alignments are viewed as pairs (the grey line links two paired-end alignments).
There are 8 reads spanning the junction point. The coverage of the ASXL1 gene is much less than that of the PDRG1 gene, which implies the fusion-gene TSV is heterogeneous.