Defining the diverse spectrum of inversions, complex structural variation, and chromothripsis in the morbid human genome

Background Structural variation (SV) influences genome organization and contributes to human disease. However, the complete mutational spectrum of SV has not been routinely captured in disease association studies. Results We sequenced 689 participants with autism spectrum disorder (ASD) and other developmental abnormalities to construct a genome-wide map of large SV. Using long-insert jumping libraries at 105X mean physical coverage and linked-read whole-genome sequencing from 10X Genomics, we document seven major SV classes at ~5 kb SV resolution. Our results encompass 11,735 distinct large SV sites, 38.1% of which are novel and 16.8% of which are balanced or complex. We characterize 16 recurrent subclasses of complex SV (cxSV), revealing that: (1) cxSV are larger and rarer than canonical SV; (2) each genome harbors 14 large cxSV on average; (3) 84.4% of large cxSVs involve inversion; and (4) most large cxSV (93.8%) have not been delineated in previous studies. Rare SVs are more likely to disrupt coding and regulatory non-coding loci, particularly when truncating constrained and disease-associated genes. We also identify multiple cases of catastrophic chromosomal rearrangements known as chromoanagenesis, including somatic chromoanasynthesis, and extreme balanced germline chromothripsis events involving up to 65 breakpoints and 60.6 Mb across four chromosomes, further defining rare categories of extreme cxSV. Conclusions These data provide a foundational map of large SV in the morbid human genome and demonstrate a previously underappreciated abundance and diversity of cxSV that should be considered in genomic studies of human disease. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1158-6) contains supplementary material, which is available to authorized users.

. Example of a cxSV with unknown impact on gene function. A 127.9 kb complex paired-deletion inversion ("delINVdel") alters multiple loci and may lead to a complex fusion product, SLCO1C1-SLCO1B3, including the net loss of five exons and inversion of three exons. SLCO1C1 has been previously described in ASD due to a de novo truncating coding mutation in an ASD-affected subject [1], and one component of this cxSV, a flanking deletion, was previously identified by WES [2]. liWGS sequencing depth visualization was generated with CNView [3].

S-11
Supplemental . These data support the possibility of this second mutational event involving mitotic missegregation into a micronucleus and subsequent chromoanasynthesis, as has been previously proposed as a primary mechanism for chromoanagenesis [6][7][8][9], possibly attributable to instability in the presence of the large de novo dupINVdup arising within 2.4 Mb of the centromere.

S-16
Supplemental Figure 8. Computational framework for consensus classification of canonical CNV segments. We used a tiered pipeline to integrate physical depth and paired-end (PE) clustering SV signals from liWGS data, and to score groups of similar putative CNV calls for qualitative confidence in each call.
In brief: we considered PE clustering and physical depth calls both jointly where they overlapped as well as independently where they did not. After refining a set of all candidate CNV intervals, we applied a kmeans copy-state genotyping algorithm to affirm or refute each interval. Finally, we screened CNVs on overlap with a set of blacklist regions including poorly mapping sequences, N-masked reference bases, and known multicopy loci (e.g. segmental duplications/low-copy repeats). See Methods for a full description of this tiered integration process. Only CNVs scored as "high" or "medium" confidence were included for genome-wide analyses, while "low" confidence CNV calls were excluded from all analyses, counts, and statistics.

S-18
Supplemental Figure 10. A de novo inversion of the 2q23.1 syndrome locus in a subject with ASD. liWGS analysis in a subject with ASD discovered a de novo 675.2kb inversion of the 2q23.1/MBD5 microdeletion syndrome locus, which completely inverted the known driver gene of 2q23.1 syndrome, MBD5, yet had no observable functional effect on MBD5 expression in proband lymphoblasts (green barplots; exons 3-4 and 7-8 tested by RT-qPCR). Direct disruption of ACVR2A resulted in reduced expression (pink barplots; exons 2-3 and 4-5 tested by RT-qPCR). Both MBD5 and ACVR2A are constrained against truncating coding variation [17,18], and both genes also lie within the 2q23.1 microdeletion syndrome critical region [19][20][21]. This inversion provides evidence for a possible secondary contributing effect from ACVR2A in the NDD phenotypes associated with 2q23.1 syndrome, and further highlights the value of WGS for SV in studies of disease genomics.

S-19
Supplemental Results 1: Evaluation of SV discovery methods from six independent validation methods. We attempted validation of 33.8% (3,756/11,108) of all fully resolved SVs using five orthogonal approaches, including: (1) PCR and Sanger sequencing of breakpoints (n=114 SV tested), (2) comparison to chromosomal microarray (CMA) data available for 99.0% (679/686) of subjects (n=1,170 SV tested) [15,22], (3) multiplexed targeted breakpoint capture and deep resequencing in 96 subjects and their parents (n=230 SV tested), (4) linked-read WGS (n=3 SV tested), and (5) comparison to standard short-insert whole-genome sequencing (siWGS) data available on a subset of subjects (n=39 subjects; n=2,470 SV tested) [23]. From these experiments, 89.4% (3,356/3,756) of all assessed SVs were validated by at least one method (Figure 1C and Additional File 1). We also sequenced pairs of technical replicate liWGS libraries in 22 subjects to further appraise our methods. Based on these replicate library pairs, we found a median of 87.5% concordance between replicate libraries. We attempted to estimate a false negative rate by evaluating 32,472 low-quality candidate SV read clusters that were filtered as 'invalid' by our computational SV pipeline, finding that 5.9% (1,906/32,472) of computationally predicted 'invalid' read clusters successfully validated from at least one assay. Finally, we assessed the sensitivity of liWGS to capture high-confidence CNVs from CMA [15], observing that liWGS successfully detected 99.0% (1,642/1,659) and 98.5% (802/814) of all deletions and duplications identified by CMA, respectively. Based on these experiments, we estimated a global false discovery rate (FDR) of 10.6% and false negative rate (FNR) of 5.9% for our SV discovery methods. Importantly, it is likely that the reference-based short-read mapping methods used here (liWGS, capture sequencing, and siWGS) share similar error models, which could have underestimated FDR and FNR. Finally, the resolution of liWGS is restricted proportional to the size of the insert, and both sensitivity and specificity for SV detection is thus greatly reduced for rearrangements smaller than ~5 kb (Additional File 2: Supplemental Note 1).
Supplemental Results 2: Coding and noncoding functional annotations for all fully resolved SV. Upon overlaying gene annotations onto the SV detected in this study [24], 27.0% (2,995/11,108) of fully resolved SVs were predicted to likely perturb coding sequence of one or more genes, of which approximately one third (35.3%; 1,057/2,995) were predicted to involve two or more genes. Collectively, SVs impacted 4,067 distinct genes in this cohort, of which 73.1% (2,972/4,067) encoded proteins. From these predictions, we estimated each subject's genome to contain 181 genes altered by SV at the resolution of liWGS, of which half (50.8%; 92/181) were computationally predicted to have at least one normal message truncated, resulting in a loss of function (LoF), and 19.3% (35/181) were subject to whole-gene copy gain (CG) of one or more genes. The remaining predictions comprised a variety of other perturbations, including gene retrocopy, translocated exons, and insertions into gene bodies, the functional outcomes of which are less certain than LoF or CG.
We also considered the contribution of SV to noncoding regulatory elements of the genome. Such noncoding regulatory sites are not as well defined as coding loci, and are further confounded by transience and tissue-specificity [25,26]. These analyses predicted 35.2% (3,910/11,108) of all fully resolved SV to disrupt at least one candidate promoter or enhancer element derived from ENCODE (v3) histone acetylation marks [27]. Despite the frequent proximity of these noncoding regulatory annotations to genes (84.0% within ±20kb of a gene body), a majority (55.4%; 2,166/3,910) of all SV with predicted noncoding regulatory impact had no direct genic perturbation. Next, by mapping all fully resolved SVs onto annotated boundaries between topologically associated domains (TADs) derived from cultured human fibroblasts [28], we identified 192 SVs in this cohort predicted to disrupt one or more TAD boundaries, indicating that a subset of SVs cause restructuring of the local 3D genome architecture, which recent evidence suggests may have relevance to human disease [29][30][31][32][33][34].
Supplemental Results 4: Cryptic de novo SV that may contribute to ASD risk. Although this study was designed to classify the mutational spectrum of SV, the extensive validation studies performed herein did confirm several de novo SVs that had not been previously detected by chromosomal microarray (CMA) and whole-exome sequencing (WES) analyses on these same subjects [15]. Two particular examples merit discussion. The first is the gene VRK1, which was disrupted by a relatively small 14 kb de novo LoF exonic deletion in one subject with ASD in this cohort (Additional File 2: Supplemental Figure 9). VRK1 is involved in chromatin remodeling [13], shares coexpression patterns with known ASD-associated genes [10], has been implicated in neuronal migration, development, and maintenance [11,12], and was remarkably predicted to be among the most highly connected genes to other genes that are recurrently mutated in ASD [14]. Yet, as our colleagues previously noted, to date there has been "essentially no genetic evidence for [VRK1's] involvement in ASD" [14]. The second example is a 675 kb de novo inversion at the 2q23.1 reciprocal microdeletion/microduplication syndrome locus, which encompasses MBD5, the syndrome's proposed driver gene (Additional File 2: Supplemental Figure 10) [19][20][21]. This inversion did not directly disrupt MBD5, instead truncating a proximal gene, ACVR2A, and quantitative real-time PCR follow-up studies in available lymphoblastoid cell lines (LCLs) revealed reduced expression of ACVR2A, but not MBD5, commensurate with a heterozygous LoF mutation. Both MBD5 and ACVR2A are constrained against LoF mutations in healthy individuals and are within in the 2q23.1 syndrome critical region [17][18][19][20][21], suggesting this de novo inversion may provide an initial clue for a possible secondary contributing locus in the 2q23.1 microdeletion syndrome, although a more concrete interpretation is challenging due to the complex and as-of-yet unresolved pathogenicity of the 2q23.1 region. These results emphasize the value of capturing the complete spectrum of genomic variation in studies of human disease.

S-23
Supplemental Note 1. Definition and validation of cxSVs. In this study, we defined rearrangements as complex if the variant allele structure involved two or more distinct rearrangement signatures (e.g. inversion & deletion) or if the rearrangement involved at least three breakpoints detectable at the resolution of liWGS; this breakpoint resolution varies by coverage and by genomic locus, but is approximately 1.5kb in most cases. However, as has been highlighted by several studies [4,[44][45][46][47], SV breakpoints frequently do not form precise junctions on the derivative chromosomes, and instead involve micro-inversions, small nontemplated insertions, or other micro-complexity. As liWGS does not provide nucleotide-level resolution of breakpoints, we were unable to categorically assess any additional breakpoint complexity below ~1.5kb. A previous study sequenced the breakpoints of 38 cxSVs detected by liWGS, finding that 5/38 (13%) rearrangements had complexity at one or more breakpoints below the detection thresholds of liWGS [48]; despite small sample sizes, these prior data provide a rough estimate of the potential frequency of SV complexity not captured by liWGS in the present study. Thus, there likely is a meaningful subset of SV that are misclassified in this present study as a consequence of the relatively limited resolution of liWGS, and that the sixteen cxSV subclasses delineated here are almost certainly an underrepresentation of the total complement of cxSVs existent in the human genome.