STRetch: detecting and discovering pathogenic short tandem repeat expansions

Short tandem repeat (STR) expansions have been identified as the causal DNA mutation in dozens of Mendelian human diseases. Traditionally, pathogenic STR expansions could only be detected by single locus techniques, such as PCR and electrophoresis. The ability to genotype STRs directly from next-generation sequencing data has the potential to reduce both the time and cost to reaching diagnosis and to discovering new causal STR loci. Most existing tools detect STR variation within the read length, and so are unable to detect the majority of pathogenic expansions. Those tools that can detect large expansions are limited to a set of known disease loci. Here we address this by presenting STRetch, a new genome-wide method to detect pathogenic STR expansions at known and novel loci. We demonstrate the use of STRetch for detecting pathogenic STR expansions in short-read whole genome sequencing data by applying it to the analysis of 97 whole genomes to reveal variation at STR loci. We further demonstrate the application of STRetch to solve cases of patients with undiagnosed disease, where STR expansions are a likely cause. STRetch assesses expansions at all STR loci in the genome and has the potential to detect novel disease-causing STR loci. STRetch is open source software, available from github.com/Oshlack/STRetch.

Here we address this by presenting STRetch, a new genome-wide method to detect pathogenic STR expansions at known and novel loci. We demonstrate the use of STRetch for detecting pathogenic STR expansions in short-read whole genome sequencing data by applying it to the analysis of 97 whole genomes to reveal variation at STR loci. We further demonstrate the application of STRetch to solve cases of patients with undiagnosed disease, where STR expansions are a likely cause. STRetch assesses expansions at all STR loci in the genome and has the potential to detect novel disease-causing STR loci.
STRetch is open source software, available from github.com/Oshlack/STRetch.

Background
Short tandem repeats (STRs), also known as microsatellites, are short (1-6bp) DNA sequences repeated consecutively. Approximately 3% of the human genome consists of STRs [1]. These loci are prone to frequent mutations and high polymorphism, with mutation rates 10 to 100,000 times higher than average mutation rates in other parts of the genome [2]. Dozens of neurological and developmental disorders have been attributed to STR expansions [3]. STRs have also been associated with a range of functions such as DNA replication and repair, chromatin organization, and regulation of gene expression [2,4,5].

STR expansions have been identified as the causal DNA mutation in almost 30
Mendelian human diseases [6]. Many of these conditions affect the nervous system, for example Huntington's disease, spinocerebellar ataxias, spinobulbar muscular atrophy, Friedreich's ataxia, fragile X syndrome and polyalanine disorders [7]. Most tandem repeat expansion disorders show dominant inheritance, with disease mechanisms varying from expansion of a peptide repeat disrupting protein function or stability, to causing the aberrant regulation of gene expression [8].
STR expansion diseases typically show genetic anticipation, characterized by greater severity and earlier age of onset as the tandem repeat expands through the generations [9]. In many STR diseases, the probability that a given individual is affected increases with the repeat length. In some cases severity also depends on the gender of the parent who transmitted the repeat expansion [10]. The number and position of imperfect repeat units also influences the stability of the allele through generations [9]. Together these features can be used to identify patients with a disease of unknown genetic basis that might be caused by an STR expansion.
Traditionally, STRs have been genotyped using gel electrophoresis. Polymerase chain reaction (PCR) is performed using primers, which are complementary to unique sequences flanking the STR. The PCR product is then run on a capillary electrophoresis gel to determine its size. Although the method has been scaled to handle dozens of samples, it is still labor-intensive and costly. Each new STR locus to be genotyped requires the design and testing of a new set of PCR primers, along with control samples.
A number of diseases are known to be caused by any one of multiple variants, including STR expansions, single nucleotide variants (SNVs) or short indels. For example there are more than ten STR loci in as many genes known to cause ataxia [11], in addition to SNVs and indels in dozens of genes [12]. For such diseases, this can mean hundreds of dollars spent per STR locus, alongside SNV and short indel testing. For such conditions there is a clear need for a single genomic test that can detect all relevant disease variants including SNVs, indels and STRs.
The ability to genotype STRs directly from next-generation sequencing data has the potential to reduce both the time and cost to reaching diagnosis and to discover new causal STR loci. It is becoming increasingly common to sequence the genomes or exomes of patients with undiagnosed genetic disorders. Currently the analysis of this data is generally restricted to SNVs and short indels, with STRs generally only investigated in an ad-hoc manner if they are a common cause of the disease. The ability to detect STR expansions in next-generation sequencing data gives the potential to perform disease variant discovery in those patients for which no known pathogenic variants are found.
The vast majority of current STR genotyping tools for short-read sequencing data (most notably LobSTR [13], HipSTR [14] and RepeatSeq [15]) are designed to look at normal population variation by looking for insertions and deletions within reads that completely span the STR. These tools are limited to genotyping alleles that are less than the read length, with sufficient unique flanking sequence to allow them to be mapped correctly. However, for most STR loci causing Mendelian disease in humans, pathogenic alleles typically exceed 100bp, with pathogenic alleles at some loci in the range 1,000-10,000bp [16], far exceeding the sizes that can be detected using these algorithms.
One STR genotyper, STRViper [17], detects alleles exceeding the read length by looking for shifts in the insert size distribution. This method requires that the insert size distribution has a relatively low standard deviation and is limited to repeats smaller than the insert size. The mean insert size can be as low as 300-400bp, meaning that for a very large pathogenic repeat, there may be very few or no spanning read pairs. This strategy is therefore unsuitable to detect many of the STR expansions known to cause disease in humans. Another tool, ExpansionHunter [18], uses read pair information and recovery of mismapped reads to detect large STR expansions.
However this tool only works on specific loci as defined by the user, and so is not a genome-wide method. Similarly eXTRa [19] detects expansions in a set of 21 userdefined loci in the genome and requires a set of control samples in order to define the statistical probability of an expansion.
Long read sequencing technologies can potentially sequence through larger repeat loci [20] however, this sequencing is currently far too expensive for the clinic. Also, high error rates and low throughput in these technologies make them less accurate for genotyping SNVs and short indels and so a poor alternative to short-read sequencing in a clinical setting. Clearly there is still a great need to be able to detect pathogenic expansions from short read data.
Here we present STRetch, a new method to detect pathogenic STR expansions at every STR locus in the genome, and estimate their approximate size directly from short read sequencing. We show that STRetch can detect pathogenic STR expansions in short-read whole genome sequencing data. We also demonstrate the application of STRetch is open source software, available from github.com/Oshlack/STRetch.

STRetch detects large STR expansions
The STRetch method has been designed to identify expanded STRs from short-read sequencing data and give approximate sizes for these alleles. Briefly, the idea behind STRetch is to first construct a set of additional STR decoy chromosomes comprising the sequences of all STR repeat units. These are added to the reference genome. By mapping to this modified genome, STRetch identifies reads that originated from large STR expansions that now preferentially map to the STR decoys. These reads are then allocated back to the genome using read pair information, and the source locus is assessed for an expansion using a statistical test based on coverage of the STR. Figure   1 outlines the STRetch method, and the details are expanded below.

STR decoy chromosomes: generating an STR-aware reference genome
A key feature of STRetch is the generation of STR decoy chromosomes to produce a custom STR-aware reference genome. Most aligners have difficulty accurately mapping reads containing long STRs. For example, although the BWA MEM algorithm has superior performance for mapping reads containing STRs [21] sometimes reads containing long STRs map to other STR loci with the same repeat unit, or completely fail to map [18]. The systematic mis-mapping of STR reads is unsurprising considering that BWA MEM is optimized to find the longest exact match [22]. For a read made up primarily of STR sequence, the best match is likely to be the longest STR locus in the reference genome with the same repeat unit.
STRetch takes the issue of systematically mis-mapped, or unmapped STR reads and uses this as a way to identify reads that contain long STR sequence. To achieve this, we introduce the concept of STR decoy chromosomes. These chromosomes are sequences that consist of 2000bp of pure STR repeat units that can be added to any reference genome. STR decoy chromosomes for all possible STR repeat units in the range 1-6bp are generated, and filtered for redundancy, resulting in 501 new chromosomes added to the reference genome (STRetch provides hg19 with STR decoys, see methods). While reads from STRs similar to the allele in the reference genome will map to their original locus, reads containing large STR expansions will preferentially align to the STR decoy chromosomes. These reads can then be further examined for evidence of a pathogenic expansion.

Mapping to STR decoys to identify reads containing STRs
Once the new STR-decoy reference genome is created, the first step maps reads against the new reference genome using BWA MEM. If the data has already been mapped STRetch can optionally just extract and re-map a subset of reads likely to contain STRs. The reads that are extracted are those that aligned to known STR loci, and unmapped reads (see methods). Any reads mapping to the STR decoy chromosomes are inferred to have originated from an STR.

Determining the origin of STR reads
Next the reads that map to the decoys are assigned to genomic STR positions.
STRetch uses the mapping position of the mate to infer which STR locus each read originates from. Known STR loci are obtained from a Tandem Repeats Finder (TRF) [23] annotation of the reference genome. For a given read, if the mate maps within 500bp of a known STR locus with the same repeat unit, then the read is assigned to that genomic STR locus (or the closest matching locus if multiple loci are present).
After all possible reads are assigned, there may be a difference between the number of reads mapping to a given STR decoy chromosome and the number of reads assigned to all STR loci with that same repeat unit. Unassigned STR reads can occur for a variety of reasons, such as their mates map too far from a known locus, their mate also maps to the STR decoy chromosome, or their mate is unmapped. This number will increase in samples with very large and/or multiple STR expansions because more read pairs will originate from the STR. This will result in STRetch underestimating the size of very large alleles.

Detecting outlier STR loci
STRetch next uses a statistical test to identify loci where an individual has an unusually large STR. Specifically, STRetch compares the number of STR decoy reads assigned to each locus for a test sample with STR reads from a control set. At each locus the reads are normalized by the average coverage of the sample. A robust zscore ("outlier score") is then used to test if the log-normalized number of reads is an outlier compared with the controls (see methods). A variety of control samples can be used. Firstly, STRetch can be run on a set of controls and then the estimated median and variance parameters from these controls can be used for subsequent test samples.
Secondly, STRetch supplies median and variance parameters estimated from a reference set of PCR-free whole genomes that can be used (see "STRetch reveals STR expansions in 97 whole genomes"). Thirdly, a set of samples that are all being tested can be used and compared with each other. The advantage of the first and last options is that the sequencing is usually run at the same centre with the same library preparation protocols, however the data sets might be smaller. If the third option is used it may not detect outliers if the same expansion occurs in many of the samples, although the statistic is robust to up to approximately 50% outliers.

Estimating the size of STR alleles
STRetch works on the assumption that, for a given locus, the number of reads containing the STR repeat unit is proportional to the length of the repeat in the genome being sequenced. Therefore STRetch estimates the size of any detected expansion using the normalized counts allocated to that STR locus. Through simulation we found that the counts are linearly related to the length and we used the simulated results for estimating the allele size (see methods).
To estimate the size of the STR expansion, we performed a simulation with various STR lengths (see methods). Specifically we simulated reads from 100 individuals with the genotype 16xCAG/NxCAG at the SCA8 locus, where N was randomly selected in the range 0 to 500. Our simulated data exhibits a linear relationship between allele size and the number of reads mapping to the STR decoy chromosome.
This relationship is used to estimate the size of alleles (Supplementary Figure 2).

Output files
On completion, STRetch generates a tab-delimited output file that contains all STR loci for which STR-decoy reads were detected for each sample. Further information includes p-values for statistical significance of an expansion, details of the STR locus (position, repeat unit, size in reference), robust outlier z-score, locus read count and the allele length estimate. By default this file is sorted such that the most significant expansions are ranked at the top.
For analysis with STRetch, we first extracted reads overlapping all known STR loci annotated by Tandem Repeats Finder (see methods), then processed these reads through the STRetch pipeline, using the hg19 reference genome. The STRetch statistics were calculated twice, first using just these 10 samples as controls for each other ("internal control") and then using the 97 WGS described below as controls ("reference control"). We also ran LobSTR/HipSTR to estimate the size of the short allele in each case.
For six of the ten samples we had information about the disease and the estimated allele size by PCR. For the other four samples (Sample 7, Sample 8, Sample 9 and Sample 10) we were initially completely blinded to all patient information, including phenotype. Disease and allele size estimates were only revealed to us after we had correctly identified the causal STR expansion in each case. Table 1 summarises the   results of this analysis, while supplementary table 1 shows allele size estimates for these samples using STRetch, LobSTR, HipSTR and ExpansionHunter. For Sample 10 we were initially unable to identify a significant expansion in a known pathogenic gene. After the variant was revealed to be a large GAA expansion at the FRDA locus we investigated further and discovered that although the reference genome has 6xGAA at this position, the STR is missing from the TRF genome annotation. This locus is positioned in the middle of an Alu element. We hypothesis that TRF fails to annotate this STR due to its relatively small size and the flanking repetitive sequences. After manually adding this locus to the genome annotation and rerunning the analysis on the ten samples in table 1, STRetch was able to detect an expansion at this locus.
Most of the true positives that were detected had significant expansions when using both the 97 reference controls and the 10 internal controls. However there were two samples (Sample 3 and Sample 7) that were only significant when using the much larger set of reference controls.
STRetch underestimates the allele size for many of the true positives (figure 2). One likely explanation for this is that the current implementation of STRetch is limited by the insert size of the sequencing data. Some alleles will be so large (e.g. DM1 in Sample 9 is >450bp) that there will be read pairs where both are completely contained within the STR. So both will map to the STR decoy and will not be assigned to an STR locus. STRetch does not try to use these reads to estimate the allele length, and so will systematically underestimate large alleles. In comparison, ExpansionHunter more closely estimated alleles in most cases, with a tendency to overestimate allele sizes (supplementary table 1). As expected, both LobSTR and HipSTR dramatically underestimated large alleles, or did not make a call in many cases.   These likely non-pathogenic variants at known pathogenic loci highlight the ability of STRetch to detect STR expansions at the genome-wide scale. This gives us the power to explore population variation at these loci, and so to better determine the true nonpathogenic range of these known pathogenic loci.

Genetic diagnosis and validation of an ataxia patient
As noted above, six ataxia patients (and two unaffected parents) were included in our 97 WGS cohort. These patients had previously been tested with a panel of the five most common ataxia STR expansions: SCA1, SCA2, SCA3, SCA6 and SCA7. The whole genome data had also been examined for causative SNVs and indels in candidate ataxia genes using the GATK Best Practices recommendations [24], and copy number variations using BreakDancer [25] and Genome STRiP [26] without success in diagnosis.
In one of these patients STRetch identified a highly significant expansion of SCA8, a known but rare disease locus in ATXN8 with an outlier z-score of 11.11 (p=3.55e-24) and an estimated allele size of 54.4x.
As a result this SCA8 expansion was validated using a PCR assay, which confirmed a pathogenic CAG expansion with an allele size of 97x. In addition, an affected sibling, not sequenced using WGS, was tested for an expansion at the same locus and was similarly determined to have a pathogenic allele of ~96x.
We ran LobSTR and HipSTR on the WGS data of this same patient. At the expanded pathogenic locus LobSTR called a homozygous 6xTGC insertion, while HipSTR called a homozygous indel deleting one TAC repeat unit upstream of SCA8 and an insertion of seven TGC repeat units for a net six repeat unit insertion. Both tools report a reference allele size of +15x, so the total allele size is 21x in both cases. PCR analysis of the proband and sibling indicated the expansions were heterozygous both with a short allele size of ~29x, however these sizes may not be directly comparable due to potential variation in the definition of the reference locus size. We configured ExpansionHunter to estimate the allele size for the SCA8 locus detected using STRetch and a 96x allele was estimated (confidence interval 76-113).
The LobSTR and HipSTR analysis reveals the danger of relying solely on within-read signals of STR expansion, as this will miss any insertions larger than the read length.
This work demonstrates the ability of STRetch to discover pathogenic STR expansions from WGS data.

Discussion
We have demonstrated the ability of STRetch to detect pathogenic STR expansions in short-read WGS data. STRetch was able to detect most of the known true positive variants where the variant size exceeded the read-length and only missed one out of nine true positives. We further applied STRetch to 97 WGS samples to detect both potentially pathogenic STR expansions and expansions of moderate length in STR disease loci, where the allele size is likely below the threshold for disease.
Importantly this set of analysed genomes can act as a control set, and statistical parameters for the STR loci are provided to use in testing for expansions with STRetch. Within this cohort STRetch revealed a previously undetected pathogenic STR expansion in SCA8 that was validated by PCR in the proband and an affected sibling.
The main limitation of STRetch is its tendency to underestimate the allele size of STR expansions, especially variants larger than the insert size. This is because STRetch only assigns STR reads to a genomic locus when there is a uniquely mapping readpair. Future implementations of STRetch could use reads where both reads in the pair map to the STR decoy chromosome to improve the allele size estimates. It should be noted, however, that the statistical approach for testing for outliers against a set of controls is not as severely affected by this limitation.
As expected, we found that STR genotypers such as LobSTR and HipSTR, that are designed to genotype short STR variation, were unable to detect large pathogenic variants in WGS data. These tools instead called a homozygous genotype, corresponding to the size of the non-expanded allele or called a heterozygous with slight variation in the small allele, or failed to make a call. Using these tools in conjunction with STRetch allows the estimation of the short allele in cases where STRetch has detected an expansion, allowing us to obtain a fuller picture of the genotype from short read sequencing data.
Finally, ExpansionHunter performed relatively well at estimating allele sizes, although tending to overestimate them, where STRetch tended to underestimate.
However a key limitation is that ExpansionHunter does not have a statistical basis for calling expansions and is not currently configured as a genome-wide tool. Each locus of interest must be defined in a separate configuration file.

Conclusions
Here we have introduced STRetch, a method to test for STR expansions using whole genome sequencing data. We have shown that STRetch can detect pathogenic STR expansions relevant to Mendelian disease. Although the emphasis has been on STRs known to cause Mendelian disease, a key advantage of STRetch over other methods is that it is a genome-wide approach. STRetch performs statistical tests for expansions at every STR locus annotated in the reference genome, and so has potential to be used for not only diagnostic applications, but also in research to discover new pathogenic STR expansions.

The STRetch pipeline
The STRetch pipeline is implemented using the Bpipe [27] pipeline framework (v0.9.9.3). This allows for a pipeline combining standard bioinformatics tools with novel scripts, and is compatible with many high-performance computing environments, allowing large-scale parallelization over multiple samples.
To summarize the pipeline and components in brief: Reads are mapped to the reference genome with STR decoy chromosomes using

Extracting likely STR reads pairs from aligned BAMs
In the case where reads have already been mapped to a reference genome, STRetch provides the option of extracting likely STR reads pairs from the BAM file for analysis, rather than remapping all reads in the sample.
In this case STRetch defines a region where STR reads are likely to align by taking the Tandem Repeats Finder [23] annotation of the reference genome and expanding the region to include 800bp of flanking sequence on each side. Reads aligned to this region, and all unmapped reads are extracted using samtools view. These are sorted to place together read pairs using samtools collate and then are extracted in fastq format using bedtools bamtofastq. Unpaired reads are discarded.

Aligning and allocating reads to STR loci
Stretch uses BWA MEM to align reads to the custom reference genome. Any read mapping to the STR decoy chromosomes is presumed to have originated from an STR locus.
To determine which STR locus the reads originated from, the mates of the reads mapping to a given STR decoy chromosome are collected. If the mate maps within 500bp of a known STR locus with the same repeat unit, it is assigned to that locus. If multiple loci fall in this range, it is assigned to the closest. This distance was chosen because the average insert size of WGS data was 500bp, so we expect the mate to map within 500bp or less of the STR locus. This value can be configured in the pipeline if required.
To correct for library size (total number of aligned reads) the counts for each STR locus are normalised against the median coverage for that sample within the target region (in the case of exome capture), or across the entire genome. Counts are normalised to a median coverage of 100x: log 2 (100*(raw counts + 1)/median sample coverage)). Log 2 normalized counts are used in subsequent statistical analyses.

Detecting outliers
To detect individuals with unusually large STRs, STRetch calculates an "outlier score" for each individual at each locus. The outlier score is a z-score calculated using robust estimates, with a positive score indicating the STR is larger than the median.
A robust z-score and p-value is calculated for each locus l using the normalized log counts. First the median and variance across all samples for locus l is estimated using Huber's M-estimator [30,32]. This calculation can be performed over all samples in a batch, or a set of control samples (estimates from the set of 97 PCR-free whole genomes are provided with STRetch). We test the null hypothesis that the log counts, for multiple testing across the loci using the Benjamini-Hochberg method [33]. A locus is called significant if the adjusted p-value is < 0.05.

Estimating allele sizes
We reasoned that the size of an expanded allele would be proportional to the number of reads containing STR sequence and hence the number of reads allocated to the STR locus. In order to estimate allele sizes we performed simulations of a single locus at a range of allele sizes. Specifically, reads were simulated from the SCA8 locus in ATXN8. One allele was held constant at 16xAGC and then we simulated repeat lengths in the other allele in the range 0-500 repeat units (selected at random from a uniform distribution). Alternate versions of the hg19 reference genome with these alleles were produced using GATK v3.6 FastaAlternateReferenceMaker. Reads were simulated from 10,000bp either side of the SCA8 locus using ART MountRainier-2016-06-05 [34]. Reads were 150bp paired end, with insert sizes sampled from a normal distribution (mean 500bp, sd 50bp) and 30X coverage (proportional coverage sampled from each haplotype). The Illumina error profile was used. Simulation code is available at github.com/hdashnow/STR-pipelines.
A plot of the number of reads mapping to the AGC decoy chromosome against the number of AGC repeat units inserted into the ATXN8 locus shows a clear linear relationship between these two variables (Supplementary Figure 2).
We can use this information from the simulated data to provide a point estimate of the allele size of any new sample we analyse with STRetch in the following manner. We fit a linear regression between the number of reads mapping to the STR decoy and the size of the allele from the simulated data (both log 2 transformed), in order to obtain estimates of the intercept and slope parameters, Here y is log 2 (coverage) and x is log 2 (allele size). Given a new data point from a real sample, the log 2 (coverage) for an STR locus of interest, the point estimate of the allele size is thus where allele size is the number of base pairs inserted relative to the reference and coverage is the normalised number of reads allocated to the locus.

Reference data
Reference genome: ucsc.hg19.fasta, with STR decoy chromosomes added as described above.
STR positions in genome annotated bed file: hg19.simpleRepeat.txt.gz. Source: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ Known STR loci are obtained by performing a Tandem Repeats Finder (TRF) [23] annotation of the reference genome. Pre-computed annotations of many genomes are available from the UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables) [35]. TRF annotations are converted to bed files annotated with two additional columns: the repeat unit/motif and the number of repeat units in the reference.

Running other STR genotypers
To estimate the size of the shorter allele, LobSTR and HipSTR were run on BAM files containing the locus of interest and 1000bp of flanking sequence on either side. We