BreakTrans: uncovering the genomic architecture of gene fusions

Producing gene fusions through genomic structural rearrangements is a major mechanism for tumor evolution. Therefore, accurately detecting gene fusions and the originating rearrangements is of great importance for personalized cancer diagnosis and targeted therapy. We present a tool, BreakTrans, that systematically maps predicted gene fusions to structural rearrangements. Thus, BreakTrans not only validates both types of predictions, but also provides mechanistic interpretations. BreakTrans effectively validates known fusions and discovers novel events in a breast cancer cell line. Applying BreakTrans to 43 breast cancer samples in The Cancer Genome Atlas identifies 90 genomically validated gene fusions. BreakTrans is available at http://bioinformatics.mdanderson.org/main/BreakTrans

Many cancers are driven by pathogenic expression of mRNA fusion transcripts produced by genomic structural rearrangements (GSRs) in tumor cells. Classic examples include BCR-ABL1 in chronic myelogenous leukaemia [1], PML-RARa in acute promyelocytic leukemia [2], and TMPRSS2-ERG in prostate cancer [3]. These fusions can arise from not only simple translocations of two distal genomic loci [4] but also complex GSRs that involve multiple distal loci [5][6][7][8]. Accurately identifying these pathogenic transcripts and the originating GSRs will have a major impact in personalized cancer diagnosis and targeted therapy [4,9].
Since 2008, next generation sequencing (NGS) technologies have been applied to identify GSR breakpoints and gene fusions. Many bioinformatics tools such as BreakDancer [10], VariationHunter [11], and CREST [12] have been developed to detect GSRs from whole genome sequencing (WGS) data. These tools predict individual genomic breakpoints by searching for clusters of abnormally mapped reads. Although generally useful, they often produce an appreciable number of false positives and false negatives introduced by insufficient coverage, short insert size, misaligned reads, GC content bias, base calling errors, and repeats [13]. Limitation in data quality and the complexity of rearrangements make it a challenging task to infer the structure of complex GSRs (or so-called genome architecture) from predicted individual breakpoints [14,15]. Meanwhile, many tools such as Tophat-fusion [16], deFuse [17], MapSplice [18], and BreakFusion [19] have been developed to detect gene fusions from whole transcriptome sequencing (WTS) data. These tools are algorithmically similar to their genomic counterparts, although they have more emphasis on mapping and ascertaining novel sequence junctions produced by mRNA-splicing and are more robust in modeling the coverage (expression). Again, these tools are associated with various types of false positives and false negatives [20] and often do not have good concordance.
When both WGS and WTS data are available, we can compare them to identify GSRs that lead to gene fusions. Because of the technical independency of these two data sources, their comparison can serve as a form of validation. In addition to improving results, mapping fusions to GSRs also elucidates the mechanistic origins of these fusions and their potential clinical values. However, such analysis is complicated by several factors. First, because of mRNA splicing, the genomic breakpoints responsible for a fusion may not be located near the fusion boundaries. Second, a fusion may be produced via multiple genomic breakpoints that join segments from distal regions of the genome. Several types of such complex GSRs have been recently revealed by WGS in various cancer types [5][6][7]21,22]. Third, not all GSRs produce new genes that can be transcribed. The properties (for example, location, type, and strand) of individual GSR breakpoints and the potential of producing valid open reading frames from existing genes need to be accounted for so as to produce biologically meaningful results. Fourth, current NGS data have limited power to accurately determine the genomic architectures of underlying alleles [23]. The technological limitations in resolving repeats and phase and the lack of physical coverage make it difficult to derive correct results.
To sufficiently address these challenges, systematic approaches are in demand. Recently, two bioinformatics tools, Comrad [24] and nFuse [25], were developed to address this challenge. Both tools align raw WGS and WTS reads while simultaneously corroborating fusions and GSRs. As an early effort, Comrad only maps a single fusion breakpoint to a single genomic breakpoint through the application of a set of ad hoc rules. As an update, nFuse maps fusion breakpoints to complex GSRs using a graph-theoretic approach. A design advantage of these tools is that they can account for ambiguous read alignment and therefore potentially minimize errors caused by misalignments. However, Comrad was only able to analyze low-path WGS data that have limited power in discovering GSR. Moreover, the self-contained design restricts them from examining hypotheses produced by other well-attested algorithms such as Tophat-fusion, MapSplice, BreakDancer and CREST.
To overcome these limitations, a modularly designed tool that focuses on mapping fusions to GSRs without re-performing breakpoint discovery may better serve the analytical demand and utilize existing resources. In this paper, we present such a bioinformatics tool, Break-Trans, that integrates the results of various fusion and GSR prediction algorithms and returns a set of genomically validated fusions with their originating alleles.

Overview of BreakTrans
BreakTrans is designed to map gene fusions predicted by a set of fusion prediction programs, such as deFuse, MapSplice, Tophat-fusion and BreakFusion, to GSR breakpoints predicted by a set of GSR prediction algorithms, such as BreakDancer, CREST, and Variation-Hunter ( Figure 1; Materials and methods). BreakTrans includes four major steps: 1) parse and read in GSR and fusion breakpoints produced by front-end tools; 2) construct a genomic breakpoint graph from GSR breakpoints; 3) search for genomic alleles (paths in the breakpoint graph) that support fusion hypotheses; and 4) output validated fusions and associated genomic alleles.

Cell-line SK-BR-3
We applied BreakTrans to study the genome and transcriptome of SK-BR-3, a breast cancer cell line. We downloaded WTS data from the NCBI Sequence Read Archive [SRA:SRP003186]. We collected fusion breakpoints from three different sources. First, we analyzed the WTS data using Tophat-fusion-0.1.0 (beta) and  Figure 1 Schematic overview of BreakTrans. Plotted as an example are three genes, A, B and C, that range from genomic positions (black nodes) a to c, d to g, and h to j, respectively. Each gene contains two exons (arrow boxes) that can be transcribed from 5' to 3'. Gene A is on the positive (+) strand, while genes B and C are on the negative (-) strand. Two sets of putative novel genomic breakpoints are identified from alignments: b+ obtained 27 fusion breakpoints. Second, we analyzed the data using BWA [26] and BreakDancer [10] with NCBI human assembly build 36 as the reference. From this analysis, we obtained 2,065 putative fusion breakpoints that contained 6 of the 10 known fusion genes in SK-BR-3 (Table 1) [27]. To further increase sensitivity, we included 28 Tophat-fusion breakpoints and 1,395 deFuse breakpoints that were previously published using the same set of WTS data [28]. This set included seven known fusion genes. Altogether, 3,498 unique fusion breakpoints were obtained (Additional file 1) that included 7 of 10 known fusion genes.
We ran BreakTrans-0.0.6 on these two sets of fusion and genomic breakpoints and obtained a set of 40 redundant fusion breakpoints that are supported by genomic alleles (Additional file 3). These fusion breakpoints are redundant (in location) due to our inclusion of multiple sources at variable nucleotide resolutions. Altogether, these 40 breakpoints nominated 8 unique fusion genes (Table 1), including 6 of the 10 known fusion genes and 2 novel ones.
Of the four known fusion genes that we missed, DHX35-ITCH and NFS1-PREX1 were likely due to insufficient coverage of the transcriptome, as indicated by a previous study [28]. CYTH1-EIF3H was due to insufficient coverage of the genome: neither BreakDancer nor CREST detected any genomic rearrangements that can be associated with this fusion. Although the WGS data we used have great sequence coverage (80-fold), their physical coverage is quite limited: the average insert size is only 211 bp with a read length of 100 bp. CSE1L-ENSG00000236127 has become obsolete because of the exclusion of ENSG00000236127 from the Ensembl database, as previously explained [28]. BreakTrans was able to validate all known fusions with sufficient coverage from this dataset, indicating its high sensitivity.
For comparison purposes, we ran nFuse-0.1.4 on the same WTS and WGS datasets using default parameters. Among the 1,994 predicted fusion breakpoints (Additional file 4), only 2 of the known fusion genes (ANKHD1-PCDH1 and SUMF1-LRRFIP2) were identified.
Eight gene fusions were predicted by BreakTrans-0.0.6, including 6 previously known fusions (1 to 6) and 2 novel ones (7 and 8). Four previously known fusions (9 to 12) were not predicted due primarily to lack of coverage. For each fusion, at least one underlying genomic allele is found, represented as a breakpoint path that consists of a serial of breakpoint strings (Materials and methods). '>' represents the predicted (5' to 3') order of a gene fusion.
These eight fusion genes were supported by nine unique alleles, as shown by the breakpoint paths in Table 1. Six of the nine alleles contain one unique genomic breakpoint, representing the simplest way of generating fusion. The allele that encodes PREX1-CPNE1 contains two breakpoints, which connect DNA segments from three different genes on chromosome 20. Included are the first three exons of PREX1, an intronic segment of PHF20, and the last three exons of CPNE1 (Figure 3a). These breakpoints are highly supported by WGS data: 17 soft-clipped reads were found at the PREX1-PHF20 breakpoint and 18 at the PHF20-CPNE1 breakpoint  The breakpoint path indicates an inverted duplication, a type of genomic rearrangement that has been commonly observed in breast cancer cell lines [21]. The WDR67-ZNF704 fusion was supported by two different alleles, containing three breakpoints and one breakpoint, respectively. These breakpoints also associate the boundaries of two distal amplicons on chromosome 8 ( Figure 2c) with soft-clipped reads identified (Figures S10 and S11 in Additional file 5).
To validate these novel fusion breakpoints, we generated two independent paired-end RNA-seq datasets (SKBR3-1 and SKBR3-2; 76 bp read length) using the SK-BR-3 lines in our lab [SRA:SRP028176]. Both PREX1-CPNE1 and MTBP-SAMD12 were rediscovered at identical breakpoints using Tophat-fusion and Break-Fusion, together with nine of the previous known fusions (Additional files 6 and 7). Note that both novel fusions were originally nominated using publicly available RNA-seq data (50 bp read length) [SRA:SRP003186] by deFuse [28], which employs alignment and fusion-calling algorithms very different from either Tophat-fusion or BreakFusion. Such independence in the data and in the analytical approaches supports both novel fusions predicted by Break-Trans as being real biological events. Interestingly, we also re-identified the genomic PREX1-PHF20 breakpoint in one of the RNA-seq datasets (SKBR3-2) (Additional file 7), which validated the existence of this breakpoint in the pre-mRNAs.
We further validated a set of associated genomic breakpoints using PCR (Additional file 8). If these genomic breakpoints were real, we should be able to observe PCR bands at expected DNA amplicon sizes. Out of the six breakpoints that we were able to design primers for, four amplicons had very clean bands (Figure 3b), which included both of the two breakpoints for PREX1-CPNE1, one for MTBP-SAMD12 and one for WDR67-ZNF704. Interestingly, two different PCR bands were observed at one of the WDR67-ZNF704 breakpoints (Figure 3b), consistent with our prediction that the WDR67-ZNF704 fusion is associated with two different genomic alleles.

BreakTrans analysis of The Cancer Genome Atlas breast cancer WGS and WTS datasets
We applied BreakTrans to 43 breast cancer samples with both WTS and WGS data from The Cancer Genome Atlas (TCGA; 12/30/2012) [34]. mRNA fusion breakpoints were nominated by BreakDancer by identifying clusters of read pairs that span different genes from the WTS BAM files produced by MapSplice [18]. The genomic breakpoints were detected by two programs, Break-Dancer and SquareDancer (K Chen et al., unpublished), which examine discordant read pairs and soft-clipped reads, respectively. Together, we obtained a set of 156,955 redundant mRNA fusion breakpoints (an average of 3,650 per sample) and another set of 305,743 genomic breakpoints (an average of 6,794 per sample). We applied BreakTrans on these sets of breakpoints in conjunction with gene models specified in TCGA Genome Annotation Format (GAF) version 2.1, provided by the University of California Santa Cruz.
BreakTrans identified 177 redundant fusion breakpoints with convincing genomic evidence, which corresponded to 90 unique sample gene pairs (Additional file 9).
None of the fusions was found to be recurrent with identical gene pairs, suggesting a high level of heterogeneity in breast cancer as consistently demonstrated by previous studies [35]. However, we found a set of genes that recurrently partnered with others: CBX3, C15orf57, BCAS3, RARA, USP15, PTPRN2, USP32, FBXL20, SNX27, WIPF2, NF1 and RAD51C. Notably, the USP family members (USP13, USP15, and USP32) were frequently involved (in five fusions). Several fusions involved a kinase at the 3' end and are potentially viable therapeutic targets: USP13-PIK3CA, GPR160-PRKCI, and FBXL20-TLK2. Among the 43 samples, 33 were found to have more than 2 gene fusions, with one (A09I) containing 10 fusion genes. Sample A09I also demonstrated extensive genomic instability with many CNAs, including focal amplification of over 60-fold ( Figure S12 in Additional file 5).
To prove the validity of BreakTrans predictions, we performed PCR validation on 20 genomic breakpoints (Additional file 10), including 9 that were associated with the above 3 multi-breakpoint fusions and 11 that we randomly selected from 9 samples. Out of these 20 breakpoints, 15 were validated as somatic, 1 as germline, and 4 as wild type (Additional file 11 and Figures S38 to S41 in Additional file 5). Further capillary sequencing of the PCR products confirmed the existence of one more breakpoint ( Figure S42 in Additional file 5). Among the validated breakpoints were both of the two breakpoints underlying NF1-NLE1, all of the three breakpoints underlying PPP3R1-TTC27, and two of the three breakpoints underlying PPP1R1B-PIPOX.

Discussion
In this work, we present a novel bioinformatics approach, BreakTrans, that systematically maps detected gene fusions to novel genomic alleles produced by GSRs, thereby validating both sets of hypotheses and providing mechanistic interpretation to validated fusions. Our analysis and experimental validation indicated very high specificity of BreakTrans. The true specificity is likely higher than our estimation (60 to 80%), given the difficulties in performing PCR validation in repetitive regions.
Our results indicated that BreakTrans could achieve higher sensitivity through integration of multiple predictors without demonstrably increasing false positive rate. This is particularly important for current practice as individual predictors tend to be conservatively configured to achieve individually low false positive rates at the cost of increasing false negative rates. This phenomenon is particularly evident in our SK-BR-3 analysis, where we observed a large proportion of calls unique to a predictor. Conventional strategies that summarize results based on majority rules have been shown to be helpful in reducing false positives [36]. However, the further loss in sensitivity was usually not characterized. Applying BreakTrans to integrate multiple call sets is clearly a different and more effective strategy, as it integrates additional data. Indeed, the two novel fusions in the SK-BR-3 set were only nominated by deFuse and would have been eliminated if a simple consensus approach were taken. Our modular design allowed users to utilize their favorite predictors and include hypotheses from any source (for example, literature). This feature relieves users from trying to determine the best predictors and post-processing strategies for their data, a non-trivial task.
Another contribution of our work is that we proposed a convention (breakpoint string) to represent individual breakpoints and breakpoint graphs, as well as simple or complex alleles that encompass one or more breakpoints. This allows the reporting and communicating of large numbers of complex hypotheses in a concise and accurate way, an important requirement for large-scale sequencing and clinical sequencing efforts [37]. It also relieves researchers from manually piecing together alleles from individual breakpoints, a complex and error-prone task.
Our current version does not contain a scoring system to characterize the confidence of output fusions and alleles. This is mainly due to the complexity in integrating heterogeneous predictions from different sources, which are associated with heterogeneous scoring systems and precision. With this version in place, we are actively working on approaches to re-score breakpoints and alleles using a genotype-likelihood framework [13,36], which will be implemented in a future version of BreakTrans.
Although BreakTrans can effectively eliminate false breakpoints by leveraging the independence of WGS and WTS data and the existing knowledge of the human transcriptome, the quality of the results is clearly dependent on the quality of the input. If a large number of false breakpoints were included and true breakpoints excluded, any approach will have difficulty deriving correct answers. Improving breakpoint accuracy itself is a non-trivial task given the complexity of the cancer genome and the limitation of NGS [13]. Therefore, it is important to apply modular design that allows problems and efforts to be distributed. BreakTrans makes it possible to separate the problem of breakpoint integration from that of breakpoint identification. Further improvement in either area will synergistically improve the final results.
Similar to other programs, BreakTrans requires sufficient coverage on both genomic and transcriptomic breakpoints to validate an event. Failure to validate an event does not necessarily negate its existence. This is a fundamental problem in analyzing heterogeneous tumor samples that often contain multiple clones of tumor cells [38] -that is, subclonal breakpoints may not receive sufficient coverage from standard bulk sample sequencing. However, as NGS continues evolving and its cost continues reducing, it becomes increasingly feasible to obtain deep coverage on both the genome and the transcriptome of subclonal cell populations [38] or even single cells [39].

Summary
We have developed a bioinformatics tool, BreakTrans, that systematically maps gene fusions to GSRs, an application that is important for molecular diagnosis and targeted therapy. Instead of re-performing breakpoint discovery, BreakTrans integrates breakpoint hypotheses from various sources using a novel breakpoint graphic approach. Our examination using the WGS and WTS data from breast cancer cell-line SK-BR-3 indicates that BreakTrans has achieved higher sensitivity and specificity than existing approaches. Applying BreakTrans to the 43 breast cancer samples in TCGA, we have identified a set of 'genomically validated' gene fusions that are promising for further functional study. As sequencing coverage continues to increase, we anticipate wide application of BreakTrans in both research and clinical settings.

Representing genomic breakpoints
Existing GSR detection programs such as BreakDancer and CREST predict individual breakpoints from clusters of abnormally aligned paired-end reads or soft-clipped reads. Each breakpoint represents a joining of two nonadjacent DNA segments (break-ends) that are adjacent in the reference genome. These breakpoints can be created by either simple genomic rearrangements, such as deletion, insertion, and duplication, or complex genome rearrangements, such as chromothripsis or close-chain translocation that creates multiple breakpoints [5][6][7]25]. The resulting relationship between the two break-ends in the subject genome is called novel adjacency, as it does not exist in the reference genome. Such a breakpoint can be represented using a graphic representation known as a breakpoint graph [40]. Here, we define a breakpoint representation in the same vein, although it is more compact to use in our context. We define a 'breakpoint string' to specify exactly how two DNA break-ends are joined together at the breakpoint (Figure 4). A breakpoint string consists of two break-ends: an in-end and an out-end. The in-end represents the end point of a DNA segment before entering the breakpoint. The out-end represents the start point of another DNA segment after exiting the breakpoint. The ends are directional (double stranded). We use '+' to represent the positive strand and '-' to represent the negative strand. Each break-end is uniquely specified by a reference genomic coordinate x (consisting of a chromosome and a position) and a direction. We use a score f to quantify the confidence of the existence of the breakpoint. Popularly used scores include the number of reads or read pairs spanning the breakpoint or a genotype likelihood [41]. For notational convenience, we use a vertical bar '|' to represent the connection between an inend and an out-end.
The definitions above allow us to specify breakpoints produced by various types of genomic structural rearrangements in a consistent and concise format (for example, x+f |y+). We further define four types of intrachromosomal rearrangement breakpoints: 'null', 'jump', 'inverse', and 'repeat' (Figure 4a-d). A 'null' breakpoint represents no breakpoint between x and y and the sequence between them is identical to the reference genome. We use a special score f = 0 to denote such a 'null' breakpoint. A 'jump' breakpoint joins together two non-adjacent segments on the same strand and skips the sequence between x and y. A breakpoint resulting from a deletion can be represented as a 'jump'. An 'inverse' breakpoint joins together two non-adjacent segments in opposite strand/orientation; it can represent breakpoints produced by inversions or inverted duplication. Finally, a 'repeat' breakpoint connects x back to an upstream position y on the same strand; it can represent breakpoints produced by tandem duplication. Similarly, we can use a breakpoint string to represent an interchromosome breakpoint, resulting from four different ways of joining the break-ends (Figure 4e-h). Taken together, breakpoint strings defined by our rules can encode most, if not all, rearrangement breakpoints. Similar to DNA, a breakpoint string can be reversecomplemented by swapping the positions of x and y and flipping the orientations -that is, x+3|y+ is identical to y-3|x-albeit on the opposite strand. This feature allows us to encode breakpoints as undirected edges while enabling strand-aware search.

Constructing the breakpoint graph
All of the existing NGS structural variant detection software output breakpoints individually, representing aberrant adjacencies in the subject genome. We can connect these breakpoints together to form a breakpoint graph, in which a node represents a genomic position that either terminates or leads a break-end, and an edge represents a breakpoint. The edges are undirected and are specific to various types of breakpoints, as specified by the breakpoint strings. In a polyploid genome, multiple alleles (chromosomes) are present. A node can thereby have multiple edges, each representing a different allele. Where no aberrancy is detected, the subject genome is assumed to have the same allele as the reference genome.
To represent a complete genomic architecture, edges with null breakpoints are added to represent the reference alleles that connect the breakpoints. Note that our representation is different from those used by nFuse, in which a node represents a break-end on a specified strand. In our case, a node only represents a position; whether it leads or terminates a break-end on a specific strand depends on specifications on the connecting edges.
With a breakpoint graph constructed, the task of decoding chromosomal architecture involves identifying paths that start at the beginning and extend to the end of the chromosome. It is clearly a computationally challenging problem to identify correct paths in a graph that contains lots of nodes and edges.

Transcriptome-guided search
To achieve accuracy and efficiency, it is desirable to simplify the graph. Rather than trying to decode the complete genome (global optimization), we can focus on expressed regions (local optimization). We ignore readthrough events, which are out of our current scope, by disconnecting the reference allele (null edges) between the end nodes of neighboring genes. We can always restore these connections if read-though events are of interest.
Similar to a genomic breakpoint, a fusion (transcriptome) breakpoint predicted from mapping RNA-seq reads to the reference genome can be specified by two genomic positions x and y that are located in two different genes. To determine the underlying allele from which the fusion is transcribed, we first identify the nearest genomic breakpoints (x 0 , y 0 ) downstream of x and y in the breakpoint graph. We then start at x 0 and perform a recursive breadth-first search: p(x 0 ) = x 0 +p(n(x 0 )) where the function p(x) denotes the alleles starting at x, n(x) denotes the set of nodes that connect to x and + represents path extension. A path terminates if it hits either node y 0 or the end of a gene. This search algorithm returns all genomic alleles (or breakpoint paths) in the breakpoint graph that support a fusion hypothesis.