Graph-guided assembly for novel HLA allele discovery

Accurate typing of human leukocyte antigen (HLA), a histocompatibility test, is important because HLA genes play various roles in immune responses, and have also been shown to be associated with many diseases such as cancer. The current gold standard for HLA typing uses DNA sequencing technology combined with sequence enrichment techniques using specially designed primers or probes, causing it to be slow and labor-intensive. Although there exist enrichment-free computational methods that use various types of sequencing data, hyper-polymorphism found in HLA region of the human genome makes it challenging to type HLA genes with high accuracy from whole genome sequencing data. Furthermore, these methods are database-matching approaches where their output is inherently limited by the completeness of already known types, forcing them to find the best matching known alleles from a database, thereby causing them to be unsuitable for discovery of rare or novel alleles. In order to ensure both high accuracy as well as the ability to type novel alleles, we have developed a graph-guided assembly technique for classical HLA genes, which is capable of assembling phased, full-length haplotype sequences of typing exons given high-coverage (>30-fold) whole genome sequencing data. Our method delivers highly accurate HLA typing, comparable to the current state-of-the-art database-matching methods. We also demonstrate that our method can type novel alleles by experimenting on various data including simulated, Illumina Platinum Genomes, and 1000 Genomes data.


Introduction 22
Human leukocyte antigen (HLA) genes are crucial to the regulation of immune system as they encode for 23 the major histocompatibility complex (MHC) consisting of cell surface proteins that control the adaptive 24 immune response. HLA genes are also known to play important roles in transplant rejection, autoimmune 25 disorders [1] and cancer [2,3]. For these reasons, accurate HLA typing is important both in clinical and 26 research settings. HLA typing is considered challenging because of the hyper-polymorphic nature of the 27 HLA region in human genome. Such high polymorphism in the HLA region is thought to be maintained 28 by strong balancing selection promoting genetic diversity [4,5]. Especially with personal genome sequenc-29 ing becoming widely common, better computational methods are needed to provide rapid and inexpensive 30 typing with high accuracy. 31 Traditionally, HLA typing or categorization was done by more laborious serology-based methods to screen 32 for HLA antibodies in a donor/receiver pair. With the birth of DNA sequencing and polymerase chain reac-33 tion (PCR), molecular typing assays such as specific oligonucleotide probe hybridization (SSOP), sequence-34 specific primer amplification (SSP), and sequence-based typing (SBT) have been developed [6]. The SBT 35 method can be either used with Sanger sequencing or NGS techniques. By using specific primers to perform 36 target enrichment prior to sequencing, SBT delivers accurate and reliable typing of HLA alleles. However, 37 all of the above molecular typing assays require a specially designed set of probes/primers, and they are 38 labor intensive, low throughput, and costly. 39 With an increasing availability of personal WGS services, an availability of accurate computational HLA settings. More than 15,000 known alleles (just for these classical genes) have been reported in the database 45 and the number of alleles is growing rapidly (Figure 1). Also, the known alleles share high sequence 46 similarities, where many alleles just differ by a base-pair substitution. Thus, it is challenging to correctly 47 pinpoint an individual's HLA types among the known alleles using WGS data [8]. 48 Previously developed enrichment-free computational methods can use whole genome sequencing (WGS), 49 whole exome sequencing (WES), or transcriptome sequencing (RNA-seq) without the use of HLA-enriched For these reasons, it is important to be able to recover HLA sequences at 1-bp resolution to enable novel 79 allele discovery as similarly done in SBT. In order to achieve this goal, we present a graph-guided assembly 80 technique called Kourami that constructs full sequences for the peptide binding domain (exons 2 and 3 for 81 class I and exon 2 for class II HLA genes-regions typed by the SBT methods) by using a modified partial 82 ordered graph (POG) [14] as a guide. Our method is the first method that directly assemble both haplotypes 83 of HLA genes rather than to infer the best matching alleles in the database. For known alleles, we show that 84 Kourami can correctly type with high accuracy (>98%), equalling that of the state-of-the-art database-85 matching method, across various WGS datasets such as simulated data, Illumina Platinum Genomes, and 86 high coverage WGS from 1000 Genomes project. At the same time, Kourami only takes a fraction of time 87 compared to other available methods with a moderate use of memory. 88 In addition, Kourami is the first HLA typer to be able to assemble novel alleles that do not appear in the 89 database. It does this by treating the HLA typing problem as an instance of graph-guided assembly, where 90 the known alleles are combined into a graph that is used to guide the assembly of new alleles. Kourami 91 therefore also represents an early example of how a population of reference sequences can be used during 92 genome assembly. We systematically show that Kourami is very accurate in its ability to construct novel 93 alleles by performing leave-one-out experiments where a known allele is artificially removed from the allele 94 database. Kourami is able to reconstruct 98% of these alleles perfectly.

96
HLA typing nomenclature 97 The current HLA allele nomenclature [15] uses a hierarchical numbering system with 4 major levels of 98 hierarchies. From the highest to the lowest category, it annotates allele groups (2-digit resolution), protein 99 sequence (4-digit resolution), exon sequence (6-digit resolution), and intron sequence (8-digit resolution). 100 For example, if two alleles encode an identical protein, they will have the same numbers for the first 2 101 levels (4-digit) of hierarchies. In practice, HLA typing is often carried out at either the protein or exon 102 level. Furthermore, the current gold standard, SBT, types just the exons that are responsible for encoding 103 the peptide binding domain (exons 2 and 3 for class I genes and exon 2 for class II genes). Using only the 104 subset of exons creates ambiguous alleles where two or more alleles share identical sequence over these 105 exons but differ in other regions. These ambiguous sequences are grouped as a 6-digit 'G' allele. Similarly, 106 4-digit 'P' grouping is used for the alleles that share same amino acid sequence over these exons. Our 107 method provides fully assembled sequence covering these exons and also outputs 6-digit 'G' resolution 108 typing result by selecting known alleles that have the smallest edit distance to the assembled sequences.

111
Overview of method 112 Our method takes an advantage of partial order graphs to capture all known alleles and further modifies the 113 graph to include variants found in the sequencing data for graph to include paths of true alleles. An overview 114 of our method is illustrated in Figure 2, and the major steps are labeled from (a) to (e). More details are 115 given in "Materials and Methods." We first create a comprehensive reference panel from a combined multiple 116 sequence alignment (MSA) of both full-length and exon-only known alleles for each HLA locus (step a).

117
Reads mapped to all known HLA loci in the human genome reference (GRCh38) are extracted (step b) and 118 aligned to the comprehensive reference panel (step c). Gene-wise partial-ordered graphs are constructed 119 using the combined MSAs and the alignments of the extracted reads are projected onto the graphs so that 120 each read alignment is stored as a path in the graphs and read depths on edges naturally become edge weights 121 (step d). When these read-or read-pair-backed paths connect 2 or more neighboring heterozygous sites of 122 2 alleles, they provide phasing information. During the alignment projection, the graphs are modified by 123 adding nodes and edges to incorporate differences found by alignment such as substitutions and indels. Note 124 that a sequence of an allele is encoded as a path through the entire graph. Finally, with the weighted graphs 125 with alignment paths, we formulate the problem of constructing the best pair of HLA allele sequences 126 as finding the pair of paths through the graph. When finding the pair, we consider consistent phasing 127 information from the reads and coverage with a use of base quality scores. Additionally, the pair of paths 128 may be identical to permit homozygous alleles. From the alignments, we extract paired-end reads aligned to all known HLA loci in chromosome 6, alter-137 nate sequences (ALT) of extended MHC (xMHC) regions, and HLA sequences (the complete set of coordi-138 nates used is in Table S1 in Supplementary Materials) included in the Human reference genome (hs38DH 139 packaged in BWA-kit v0.7.15). In the GRCh38 assembly, regions that exhibit sufficient variability are 140 represented in the primary chromosomal sequence as well as the ALT loci scaffolds.  Table 1.

145
The other methods compared here use earlier versions of the database because the content of database is 146 built into their software and there is no way to update or swap database at the user level. Using a later 147 version of the database does not give advantages as long as the earlier version also contains the true alleles 148 of testing individual.

149
Many alleles in the database only contain partial sequences, often just covering few exons responsible for 150 peptide binding domain of HLA genes (Table 1). For this reason, the IPD provides a set of pre-computed if the allele for the row has a corresponding row in M gene , intronic columns are inserted into M coding , 158 otherwise, intronic columns of the reference allele in M gene are inserted. In addition to the HLA genes that 159 are included in the IPD-IMGT/HLA database, non-polymorphic HLA genes DQA2 and DQB2, paralogous 160 copies of DQA1 and DQB1 and often regarded as poorly polymorphic, are added to the reference panel as 161 decoys to filter out reads originating from them aligning incorrectly to other class II genes. In our analysis, 162 we noticed that reads coming from DQA2 or DQB2 can make the assembly of typing exons of class II genes 163 difficult as previously reported [10].

165
In order to capture all information contained in M panel in a minimal manner as well as to allow flexibility 166 to enable novel sequence discovery, we use partial order graphs, a compact graphical representation for  Full-length denotes the total of number full-length alleles in the release and total number includes the full-length alleles and the alleles with only exon sequences reported.
vertices (e v i ,v i+1 ) exists if M panel has a row with consecutive bases b v i and b v i+1 at columns i and i + 1.

175
It is important to note that this graph contains at least the same number of paths as the number of rows in at column 2 to the vertex C at column 3 is the only modification needed for the graph to include the path 184 that encodes the novel allele. If a novel allele exists in data, there must be sequencing read that contain 185 the differences the novel allele has compared to known alleles. Assuming the sequence divergence is small 186 enough for pairwise alignment of the read and a known allele to capture the differences, we can obtain the 187 novel variants. For this reason, we further modify the HLA-graph to include additional paths that encode 188 for novel alleles in a test individual. We achieve this by modifying the previously constructed HLA-graph 189 by projecting the alignments of the reads likely coming from HLA region to known HLA genes. Given the HLA-graph with weights, assembling HLA alleles can be formulated as the problem of finding 231 two (diploid) paths (they can be identical) that explain the read mapping data (weights and phasing) best.

232
For example, the read depth value for an edge can be thought of as a capacity of the edge in classical flow 233 problems. When considering only the weights, we can find two paths where the sum of the flow values of the 234 paths are maximized. However, this formulation does not handle phasing information embedded by reads or 235 read pairs, therefore it can possibly select erroneous paths that are not consistent with phasing information.

236
For this reason, we want our objective to take both weights as well as phasing information into account.

237
Since read information is embedded on the HLA-graph, we can check if two neighboring variant sites can 238 be phased directly by a read or read pair. For example, given two heterozygous sites with A/T and G/C, 239 a read or a read pair carrying 'A' followed by 'G' at these sites indicates the chromosomal phase of 'AG' 240 since the sequencing read is assumed to come from a contiguous segment in a chromosome. In our method, 241 we first investigate variant regions individually to select locally phased paths with strong read support and 242 construct a set of full-length paths through the HLA-graph by connecting the locally phased paths that can 243 be further phased by read or read pair. Each of these full-length paths is considered as a candidate allele and 244 the best pair among the candidates with maximum read and phasing support is selected as the output. To 245 only consider nonzero-weight full-length paths, we remove all zero-weight edges and disconnected vertices 246 prior to finding paths.

247
HLA-graph to bubble graph. 248 We first focus on the parts of the HLA-graph where variations are captured, which are often referred to as 249 bubbles in sequence assembly graphs [20,21,22,23]. In the HLA-graph, we define a bubble as a region can easily be recognized in the HLA-graph because of its structure.

261
Finding the best set of paths in a bubble.

262
Ideally, we want to find exactly 2 paths per bubble since the ploidy number is 2 for humans. However, bubbles may contain more than 2 paths because of sequencing errors or misalignment. Therefore, we first identify all paths that are phased by a read or read pair. For each bubble, we can use a modified breadth first search (BFS) technique to obtain all paths that go through the bubble. In order to avoid enumerating over all paths through a bubble, we prune any path without a read backing the sequence encoded by the path at each iteration of BFS. For a path in the bubble to be retained, it must be supported by at least one read phasing the entire path. We can simply compute the set of phased reads for a path by taking a series of intersections of read sets maintained by each edge in the path. Each phased path through a bubble is a called a bubble path. Given multiple bubble paths from a bubble, our goal is to select the best pair of paths. We iterate over all possible pairs of bubble paths to calculate the posterior probability of each pair given all reads aligned to the bubble to find the pair that gives the maximum probability. We write the posterior probability of a given genotype as where G b is a genotype and D is the alignments of all reads aligned over the bubble. The genotype is a pair of bubble paths G b = (H b1 , H b2 ). Each d 2 D is an alignment string of a segment of a read and d i is is constant and we assume that the prior probability P (G b = (H b1 , H b2 )), is uniformly distributed over all genotypes. We can then compute the conditional probability P (D | G b ) by adopting widely used formulations [18,24] with small variations to allow multiple positions and base 'N' case that can be present from sequence data. We iterate over each read and compute P (D | G b ) as a product of the conditional probability of each read d. Since a read must come from one of the two chromosomes, and we assume that d is equally likely to come from either one of them, we can rewrite it as a sum of average of two cases where d is from H b1 and H b2 and The probability of seeing a base given an allele is defined as where ✏ of base symbol d i is the error probability obtained from the phred score of the base. For the case of d i ='N', we simply estimate the probability as 1/4. Instead of selecting H b from all possible |d|-mers, we limit to only the bubble paths found in the bubble and iterate over all pairs to select a pair of bubble paths P b that jointly gives the maximum probability: Phasing paths. 263 We now have an ordered list of bubbles, and a list of "best" read-backed phased bubble paths for each 264 bubble. The goal here is to find a set of candidate paths through all the bubbles by merging one bubble at a pairs as one path. We also update the phasing-read set for each merged path.

271
Selecting the best pair of candidate alleles.

272
Once the assembly by bubble merging is finished, we have a set of merged bubble paths through all bubbles.
By placing back the linear chains that were ignored during bubble merging to original positions (in between bubbles), we have a full-length candidate allele H i for each merged bubble path. Let C be the set of all candidate alleles and B be a set of all bubbles. Our goal is to select a pair of alleles (H 1 , H 2 ) 2 C ⇥ C that has the most consistent phasing support over all bubbles. We first define a scoring metric that checks for strength of phasing support jointly for a pair of allele H 1 and H 2 between a pair of consecutive bubbles b i and b i+1 and it is defined as Description of data used for evaluation 273 Simulated data.  profiles trained from Illumina re-sequencing of known samples. We used 2 x 100bp for the read length and 281 500 +/-50bp for the mean and the standard deviation of the insert size.  Accuracy is shown as a fractional number and the fraction of number of correctly typed alleles and total number of alleles tested is shown in parenthesis.

304
Simulation 305 In order to check that our method performs well, we tested our method on simulated data (see "Materials   The major benefit of our method is that it can assemble novel alleles across the typing exons, therefore its 329 typing capacity is not limited by known alleles as is the case with other database-matching methods. Unlike 330 the database-matching methods, Kourami uses the known alleles in the input database only to construct 331 the HLA-graph that serves as a template for reference-based assembly but does not discriminate between 332 the paths that encodes known alleles and novel alleles.

333
In order to demonstrate the ability to assemble novel alleles, we evaluated Kourami across various data 334 where ground truth is known. We tested on the simulated data and the real data with previously validated   matching methods such as PHLAT and HLA * PRG are not only unable to discover novel alleles but also faced 354 with a problem of selecting the best out of many alleles with equally similar sequences.  Accuracy is shown as a fractional number and the fraction of number of correctly typed alleles and total number of alleles tested is shown in parenthesis.
Trio consistency and inferred haplotypes.

366
The pedigree of Illumina platinum genomes include many third generation offspring and only the top right-367 hand trio in Figure 6 has previously validated HLA typing results. Since this trio includes the mother  We tested all 3 methods to determine whether predictions are trio-consistent across all trios (trio consis-373 tency shown in Table 4). Kourami and HLA * PRG agreed on all 204 alleles at 6-digit 'G' resolution and 374 the predicted alleles were trio-consistent and inferred haplotypes across HLA genes (intra-gene phased) are 375 shown in Figure 7. PHLAT's predictions were trio-consistent only for HLA-C and HLA-DQB1 when eval-376 uated at 4-digit 'P' resolution, and additionally for HLA-A when evaluated at 2-digit resolution. Although, 377 we do not know the true HLA types for the rest of 14 individuals, it is very likely that the predicted HLA 378 types are correct given that all typing results are consistent. Low trio-consistency ratios for PHLAT in Ta-379 ble 4 is mainly due to mistyped alleles in HLA-A and HLA-B in the NA12877 individual. Assuming the 380 predicted HLA types for the pedigree are correct, no recombination seems to have occurred, leaving no 381 disruption in ancestral haplotypes. In Figure 7, we labeled the haplotypes that are originating from the first 382 generation members as paternal-grand-father1/2 (PGF1, PGF2), paternal-grand-mother1/2 (PGM1, PGM2), We tested all three methods on this data set and the result is summarized in Table 5. PHLAT called 93 out 389 of 122 alleles correctly, resulting in 76% accuracy when evaluated at 4-digit 'P' resolution, and 89% when Kourami is able to assemble and type HLA alleles given WGS data in a fraction of the time compared 397 to the state-of-art methods such as PHLAT and HLA * PRG with a moderate use of memory. We compared 398 the CPU and memory usage using the WGS of NA12878 from Platinum Genomes data (2 x 101bp 55x).

399
All tests were run on the input of the reads aligning to xMHC region and unmapped reads. HLA * PRG We have shown that our HLA assembly method can accurately reconstruct both haplotypes that span the 409 typing exons of HLA genes by using a graph representation of known alleles as a guide, and the produced 410 haplotype sequences can be used to successfully carry out HLA typing given high coverage (> 30-fold) 411 paired-end WGS. WGS carried out for other analysis can be used to type individual's HLA types without 412 the use of another experiment (SBT and other molecular assays).

413
Most notably, the ability to discover rare and novel alleles is achieved by taking an advantage of the flexibil-414 ity of POG, combined with graph modification and it is instrumental in both research and clinical settings.

415
It is important to note that previously available computational methods using non-targeted sequencing data 416 cannot discover novel alleles because they are designed to find the best matching allele among the known 417 alleles. Especially with continuously decreasing cost of obtaining a personal genome, personal WGS will 418 only become more widely available, and our method can deliver accurate HLA typing without additional 419 experiments and cost. Also, Kourami is able to assemble and type at 6 digit 'G' resolution at a fraction of 420 the time compared to other methods with a moderate amount of memory usuage.

421
One limitation of our method is that it only supports WGS as it needs enough reads to cover both haplotypes 422 for each typing locus, and does not work on other NGS assays, such as WES or RNA-Seq. Since WES is 423 being used widely, as the cost for WES is lower compared to that of WGS, it is useful to be able to type 424 HLA genes using WES. However our testing (not shown) shows that it is difficult to accurately assemble 425 a full length sequence across the typing exons with WES because there are regions for which no reads are 426 sequenced. This may be due to biases that WES has been reported to have [30] as well as decrease in 427 effectiveness in detecting variants when using WES compared to WGS [30,31]. Additionaly, high-coverage 428 WGS is required to ensure accurate HLA assembly or typing. We randomly sampled coverages of 20x, 429 30x, and 40x from NA12878 data (Illumina Platinum Genomes) for 5 replicates and tested Kourami on 430 these samples. Assembly and typing stays accurate down to 30x coverage (accuracy of 0.97 across the HLA 431 genes) but at 20x coverage, the accuracy drops to 0.82 (Supplementary Table S6). This should not be a 432 surprise as haplotype-resolved assemblies of human genomes used ⇡100x coverage of NGS data [32,33].

433
Highly accurate results from our method signifies the recent advancement in handling genetic variation using 434 graph structures to encode variations found in multiple reference genomes [34,35,36,13]. Specifically   (e) Haplotype assembly of two alleles is obtained by finding two paths (drawn in red and blue;overlap in purple) through the graph.  For each operation, an alignment of read r to one of the known alleles H i is used to modify the graph. Each operation is applied to the POG and the resulting graph is shown. The nodes and edges that are newly added or changed during the operation is shown in red. The nodes that read path maps are shown as bold circles. For the case of the insertion into a new column, the newly assigned edge weights are explicitly drawn in using x and y variables. . HLA-graph to bubble graph An example of HLA graph with 3 bubbles (enclosed in dotted boxes) are shown (a) and its corresponding bubble graph is shown (b). Best paths through the bubbles can be thought of as a pair of distinct colored paths (shown in red and blue).