Robust detection of tandem repeat expansions from long DNA reads

Tandemly repeated sequences are highly mutable and variable features of genomes. Tandem repeat expansions are responsible for a growing list of human diseases, even though it is hard to determine tandem repeat sequences with current DNA sequencing technology. Recent long-read technologies are promising, because the DNA reads are often longer than the repetitive regions, but are hampered by high error rates. Here, we report robust detection of human repeat expansions from careful alignments of long (PacBio and nanopore) reads to a reference genome. Our method (tandem-genotypes) is robust to systematic sequencing errors, inexact repeats with fuzzy boundaries, and low sequencing coverage. By comparing to healthy controls, we can prioritize pathological expansions within the top 10 out of 700000 tandem repeats in the genome. This may help to elucidate the many genetic diseases whose causes remain unknown.


Introduction
A tandem repeat is a region where multiple adjacent copies of sequence reside in the genomic DNA. These regions are highly variable among individuals due to replication error during cell division. They are a source of phenotypic variability in disease and health. More than 30 human diseases are caused by copy-number alterations in tandem repeats 1 .
The range of pathogenic copy number change relative to the reference varies from a few to several thousand, and the length of repeating unit varies from e.g. three (triplet-repeat disease) to several thousand (macro-satellite repeat). As might be expected from such diverse underlying genetic causes, disease mechanisms are also variable. Well-known examples of triplet-repeat expansion diseases in protein-coding regions are polyglutamine diseases (e.g. spinal and bulbar muscular atrophy, Huntington's disease) 2,3 . Triplet-repeat expansion of CAG or CAA codons, which encode glutamine, leads to toxic protein aggregation and neuronal cell death. Another example of triplet-repeat disease is caused by CUG repeat expansion in the 3'UTR of the transcript from the DMPK gene, producing a toxic gain-of-function transcript which sequesters splicing factor proteins and causes aberrant splicing, resulting in multiple symptoms 4 . Not only gain-of-function mutations, but also loss-of-function repeat change in the promoter region due to transcriptional silencing has been reported (e.g. fragile X syndrome) 5 . In addition to short tandem repeat diseases, repeat copy-number aberration in human disease is also reported in a macro-satellite repeat (D4Z4). Shortening of the D4Z4 repeat causes aberrant expression of the flanking gene DUX4, which has a toxic effect in muscle cells 6 . The thresholds of pathogenic repeat expansion in coding regions are usually less than 100 copies and sometimes even a few copy differences can cause disease (e.g. oculopharyngeal muscular dystrophy) 7 . In contrast, some disease causing tandem-repeat expansions in introns or UTRs can be very long (e.g. DMPK) 4 .
It has been roughly a decade since the introduction of high throughput short read sequencers to clinical genetics. There have been numerous successful identifications of small nucleotide changes, especially in coding regions, mainly thanks to targeted sequencing (e.g. whole exome sequencing).
However, the diagnostic rate remains 30% (depending on the diagnostic platform used) 11 , leaving a large population of Mendelian diseases unsolved.
There may be many reasons, but one of the simplest is that the remaining patients may have mutations in "non-coding regions", or they may have mutations in coding regions which were overlooked due to the limitations of short read sequencing. One candidate is tandem repeat regions, which are difficult to analyze genome-wide by conventional techniques. Identification of disease-causing tandem repeat changes is usually realized by classical genetic technologies (i.e. linkage analysis, Southern blot, etc.) and targeted repeat region analysis in a large number of families.
The recent advancement of long read sequencing technologies may provide a good solution, because long enough reads can encompass whole expanded repeats, and can be analyzed using the flanking unique sequences.
Long read sequencers such as PacBio or nanopore sequencers have been coming to clinical genetics very recently. As of 2018, these technologies are continually improving in terms of accuracy and data output. However, in the clinical laboratory, it is still difficult due to cost-efficiency and the computing burden for large data. It would be preferable, and realistic, if low coverage data (~10X) can be used to detect alteration of tandem repeats.
We are aware of two previous methods for determining tandem repeat copy number from long DNA reads: PacmonSTR and RepeatHMM 12-13 . These methods: align the reads to a reference genome, then get the reads that cover a tandem repeat region of the reference, and perform sophisticated probability-based comparisons of these reads to the sequence of the repeating unit. In this study, however, we find that these methods do not always succeed with current long-read data.
We have recently advocated a method (using the LAST software) for aligning DNA reads to a genome allowing for rearrangements and duplications 14 .
This method has two key features. First: it determines the rates of insertion, deletion, and each kind of substitution in the data, and uses these rates to 9 determine the most probable alignments 15 . Second: it finds the most-probable division of each read into (one or more) parts together with the most probable alignment of each part. This method found diverse types of rearrangement, the most common of which was tandem multiplication (e.g. heptuplication), often of tandem-repeat regions 14 .
Here, we detect tandem repeat copy number changes by aligning long DNA reads to a reference genome with LAST, and analyzing these alignments in a crude-but-effective way. We point out several practical difficulties with analyzing tandem repeat sequences, which motivate our crude analysis method.
Our approach is capable of analyzing tandem-repeats genome-wide even with relatively low coverage sequencing data. We believe that this tool will be very useful for identifying disease-causing mutations in tandem repeat regions which have been overlooked by short read sequencing in human disease.

tandem-genotypes method
tandem-genotypes (https://github.com/mcfrith/tandem-genotypes) has two required inputs: annotations of tandem repeats in a reference genome, and alignments of DNA reads to the same genome.
The annotations supply a start and end coordinate for each repeat, and the length u of its repeating unit. The repeat length need not be an integer multiple of the unit length. We define two ad hoc distances: "far" f = max [100, u] and "near" n = max [60, u]. (Actually, we truncate f at the edge of the sequence: where we speak of f, we really mean min[f, distance to the edge of the reference sequence].) last-split finds a division of each DNA read into (one or more) parts, and an alignment of each part to the genome. It gives each alignment a "mismap probability", which is high if that part of the read aligns almost equally well to other loci 14 . We regard one read's alignments as ordered by their 5' to 3' positions in the read.
2. Discard alignments of mostly-lowercase sequence. This removes alignments that consist almost entirely of simple sequence (such as atatatatata), which are less reliable. Simple sequence is detected and lowercased by lastdb and lastal, using tantan 16 . An alignment is discarded if it lacks any segment with score >= lastal's score threshold, when "gentle masking" is applied 17 .
3. Join consecutive alignments that are colinear on the same strand of the same chromosome, and separated by <= 10 6 bp. 4. Find all alignments that overlap a given repeat.
If there is one such alignment: • Require that it extends beyond both sides of the repeat by at least f, else give up (i.e. don't use this DNA read for this repeat).  If there is more than one such alignment:

12
• Require that they are consecutive in the read, and on the same strand.

13
• Define the "left" alignment to be the first one if the strand is "+", else the last one. Define the "right" alignment in the opposite way.
• Require that the left alignment extends leftwards of the repeat, and leftwards of the other alignments, by at least f. Require that the other alignments do not extend leftwards of the repeat by f or more.
• Likewise for the right alignment.
• Define the insertion size as: the number of query bases, minus the number of reference bases (which could be negative), between the end of the left alignment and the start of the right alignment.
• Find the nearest multiple of u to this insertion size (as above).

Prioritization of copy number changes
The repeats are ranked by a priority score. Each repeat has multiple predictions of copy-number change, one per DNA read. If the average number of predictions (for repeats with at least 1 prediction) is >= 3, ignore the most extreme 14 expansion and contraction per repeat. For each repeat, take the most extreme remaining change, and calculate: (length increase in bases) / (reference repeat length + 30).
(The + 30 prevents excessive scores for short reference repeats.) This score is multiplied by an ad hoc value per gene annotation, currently: 50 for coding, 20 for UTR, 15 for promoter, 15 for exon of non-protein-coding RNA, and 5 for intron. This is multiplied by 2 for coding annotations where the repeating unit codes polyglutamine or polyalanine in any reading frame (out of 6). The repeats are ranked by absolute value of this priority score.

Generating tandem-repeat containing plasmids
Plasmids containing various numbers of CAG, GGGGCC and CAA used for this study were generated as described elsewhere [19][20][21] and are available upon request (Supplemental Table1). Sequence data of the plasmids were deposited in (DRA007012, Table2).

Comparison to RepeatHMM and PacmonSTR
The SCA10 and BAFME reads were analyzed as follows.

Running tandem-genotypes
The copy number changes of tandem repeats were determined by tandem-genotypes v.1.1.0 with some different options.

22
Tandem repeat changes in a chimpanzee, relative to the reference human genome, were found like this: tandem-genotypes -g refFlat.txt rmsk.txt hg38-panTro5-1.maf These alignments (from https://github.com/mcfrith/last-genome-alignments) 14 are of an assembled chimp genome, not long reads: our methods work in this case too.

Chimeric reads of plasmid-derived repeats and human-derived flanks
Human nanopore reads covering repeat expansion disease loci were extracted from whole genome sequence data (rel3). The nanopore sequences flanking the repeat were excised using an in-house script. Randomly selected expanded and non-expanded repeat sequences were excised from plasmid nanopore sequences, using maf-cut. Then, these expanded and non-expanded repeats were inserted between the flanks, to generate chimeric reads. The combinations of repeat copy and number of rel3 reads are shown in Table 1, imitating the diploid genome. The chimeric reads were aligned to GRCh38 as mentioned above with WindowMasker 23 .
Dot-plot pictures were made with last-dotplot using the following command;

PCR amplification of inexact repeats in rel3 and PacBio data
Two inexact tandem-repeats were tested by PCR and Sanger sequencing.
Primers are described in Supplemental Table 2. PCR amplification was done using KAPA HiFi HS DNA polymerase (Kapa Biosystems, Basel, Switzerland).
PCR products were cloned into pCR-Blunt vector (Thermo Fisher Scientific, MA, USA) and subjected to Sanger sequencing.

Ethics
The Institutional Review Board of Yokohama City University of Medicine approved the experimental protocols. Informed consent was obtained from the patient, in accordance with Japanese regulatory requirements.

Nanopore sequencing of tandem-repeat containing plasmids
We tested plasmids with four different kinds of repeat (CAG, CAA, GGGGCC and iCCTG) that are known to cause human diseases. The CAG inspection of alignment dotplots (not shown) suggests that these are correct, and the copy numbers in these plasmids may not be completely stable.
In one case, pBS-(CAG)30, tandem-genotypes fails with almost no predictions. pBS-(CAG)30 was digested at an enzyme site very near the repeat region (10-bp upstream), so there is only 10 bp of non-repeat sequence upstream of the repeat, which is too short for step 2 of tandem-genotypes.
Thus, we digested the same plasmid with a different enzyme, DraIII, far from the start of the repeat. As expected, the tandem-genotypes prediction agrees with the actual copy number change (Supplemental Figure 1a, red arrows).
The GGGGCC repeats have bimodal copy number predictions, where the two modes correspond to reads from each DNA strand (Fig 1h-i,   Supplemental Fig 1b). This can be explained by sequence-specific sequencing error tendencies 24 that differ for the forward-and reverse-strand repeat sequences.
In human genomic sequencing, it might be difficult to obtain deep coverage such as 1000X as we have done for these plasmids, due to cost. For some repeat expansion diseases, such as polyglutamine disease, disease-causing copy number change thresholds from the controls are usually 27 less than 100. To test the ability of tandem-genotypes to distinguish fine copy number differences of these plasmids even with low coverage, we randomly picked 50, 30 and 15 reads from each dataset and compared the copy number predictions. Even with low coverage (15X) it is not difficult to distinguish copy numbers 18, 30, 70 and 130 for CAG repeats; 15 and 109 for CAA repeats; 21 and 52 for GGGGCC repeats (Figure 2).

Analyzing chimeric human/plasmid nanopore reads
We performed further tests on semi-artificial data. We obtained human nanopore reads that cover 10 disease-associated repeat regions, and replaced the repeat region in each read with the repeat region of a plasmid nanopore read.
We used plasmid repeat regions with disease-causing and healthy repeat copy numbers in 1:1 ratio (Table1). These chimeric reads were aligned to a reference human genome, and copy-number changes were predicted. For each repeat, the predictions have clear bimodal distributions (Figure 3). As these plasmid-origin repeats were expected to vary because of nanopore base-call error or replication error in Escherichia coli during plasmid preparation, we also made chimeric sequences by inserting the exact number of exact repeats, to test the accuracy of tandem-genotypes. The results were improved in all cases ( Figure 3), in particular the C9orf72 case with large strand-bias, indicating that deviations from the projected copy numbers are due to nanopore sequencing errors or replication error, and not systematic errors of tandem-genotypes.

Pacbio sequencing datasets of patients with SCA10
Next, we examined four PacBio sequencing datasets of cloned PCR amplification products from the SCA10 disease locus (spinocerebellar ataxia 10, MIM 603516). SCA10 is caused by ATTCT expansion in the intron of ATXN10.
These datasets are from three unrelated patients: A, B and C 9 . Patient C has two datasets, C-1 and C-2, which are different clones sequenced with different  Figure 1b 9 , the purified clone they sequenced by PacBio had > 6 kb insertion, making the actual repeat size > 4 kb, closer to our prediction.
The same datasets were also analyzed with RepeatHMM and PacmonSTR. We first ran RepeatHMM with straightforward parameters: for subject A it found a similar peak to us but also an unexpected peak around zero, and it did not find the expected peaks for the other three datasets (Figure 4b).
We then consulted the RepeatHMM authors, who suggested non-obvious parameters that improved the C-2 result, but there was still a peak around zero ( Figure 4c). PacmonSTR did not detect the expected repeat numbers (data not shown).

PCR and Sanger sequencing of non-exact expanded repeats in NA12878
Repeat annotations (i.e. rmsk from UCSC) include non-exact tandem-repeats. Non-exact or interrupted tandem-repeats sometimes cause human disease 8 . We detected inexact repeat expansions in the NA12878 human genome, by applying tandem-genotypes to the PacBio and MinION datasets.
tandem-genotypes found two peaks for this repeat, indicating a heterozygous ~300 bp insertion (Figure 5a). PCR amplification of this region showed two different products possibly from different alleles. One had the same length as the reference sequence (PCDH15-intron-repeat-S) (Figure 5a). Sanger sequence of the other longer PCR product (PCDH15-intron-repeat-L) revealed a ~332 bp insertion. Surprisingly, this insertion was not a tandem expansion, but rather an AluYb8 SINE (according to RepeatMasker).
We also examined an intergenic GT tandem repeat (chr8:48173947-48174212). tandem-genotypes found one peak indicating an insertion of ~1000 bp (Figure 5b). PCR of this region showed a single product from this tandem repeat region, estimated to contain a ~1000 bp insertion.
Sanger sequencing revealed that this expansion includes not only GT but also some unknown sequence. The expanded sequence is present in the chimpanzee genome (Figure 5b), indicating that this is actually a deletion in the human reference genome (which may have occurred by recombination between GT tandem repeats).
These two examples indicate that tandem-genotypes can also find complex and interrupted expansions (or non-deletions) of tandem repeats.

Pacbio sequencing of a patient with BAFME
We further analyzed PacBio whole genome sequencing of a patient with a phenotype of benign adult familial myoclonic epilepsy (BAFME). In another large number of BAFME patients in Japan 10 , the cause was attributed to large expansions of intronic TTTCA and TTTTA repeats in SAMD12. We wished to know whether our patient has such an expansion in SAMD12. We sequenced this patient's genomic DNA using a PacBio Sequel sequencer.
tandem-genotypes detected ~5 kb insertion in three reads at the BAFME locus, where the coverage is 6X (Figure 6a). We also applied RepeatHMM and PacmonSTR to this dataset, but they failed to predict any expansions.

Some difficulties with tandem repeat analysis
The three expanded BAFME reads do not align to the repetitive region as would be expected for a straightforward repeat expansion (Figure 6b). Read 2, from the forward genome strand, does not align to the repetitive region at all, because its expanded region consists mostly of TCCCC repeats whereas the forward strand of the reference genome has TAAAA repeats (Supplemental Figure 2). Reads 1 and 6, from the reverse genome strand, align to the repeat at only one side of the expansion. The expanded regions of these two reads start with TTTTA repeats, which match the reverse strand of the reference, but mostly 33 consist of TTTTC repeats. Since the expanded region of read 2 does not match the reverse complement of read 1 or 6, we infer that systematic sequencing error has occurred on at least one strand. Short-period tandem repeats are prone to a nasty kind of sequencing error: if a systematic error occurs for one repeat unit, the same error will tend to occur for all the other units, producing a different repeat (which may align elsewhere in the genome: the main reason for step 2 in tandem-genotypes).
Another kind of difficulty is illustrated by our chimeric human/plasmid reads for ATXN7 (Figure 6c). Here, the reference sequence adjacent to the annotated repeat is similar to the sequence within the repeat. Depending on the exact sequences and alignment parameters, the expanded region of a read may align outside the repeat annotation (Figure 6c top), or appear as alignment gaps some distance beyond the repeat (Figure 6c bottom). tandem-genotypes handles such cases, up to a point, by examining the alignments out to ad hoc distances beyond the annotated repeat.

Specificity of repeat-expansion predictions
tandem-genotypes can handle custom-made repeat annotation files in BED-like format. We made an annotation file with 31 repeat expansion disease loci, including BAFME, and analyzed these 31 repeats with our BAFME data. No large pathological expansions other than BAFME were predicted (Supplemental Figure 3). We also analyzed these 31 repeats with each of the nanopore and PacBio datasets for NA12878: no obvious pathological expansions were predicted and peaks are around zero in most cases (Supplemental Figure 4, 5). These results suggest that our method does not spuriously predict pathological repeat expansions, although there may be some difficulties detecting small disease-causing expansions (e.g. +2 alanine expansion in PABPN1 causes disease) due to deviations toward copy-number increase in PacBio sequences. We believe this will be solved when sequencing quality improves.

Prioritization of copy number changes
We also tested if it is possible to prioritize the detected copy number changes. The BAFME repeat expansion was ranked 4 th out of 0.7 million tandem repeat regions in rmsk.txt (Figure 7a). When prioritization was done without any control datasets, it was ranked 13 th , so using controls greatly improved prioritization (FIgure 7c).
Repeat expansions in coding regions can cause disease with less than 100 extra copies, due to the high impact on proteins even from relatively small expansions. So these expansions may be difficult to prioritize. To test this, we combined tandem-genotypes output for the whole genome (NA12878 rel3) with outputs for the plasmid-rel3 chimeric reads with coding-region expansions (ATN1, HTT, ATXN2, ATXN3, CACNA1A, ATXN7 and AR). All 7 chimeric expansions were ranked in the top 10 out of 0.7 million repeat regions ( Figure   7b). Again, controls greatly improved prioritization (Figure 7c).

PromethION and Pacbio RSII sequencing data
We next examined the genome-wide repeat copy number changes in the NA12878 human genome sequenced by both PacBio RSII (SRR3197748) and Oxford Nanopore Technology's MinION (rel3). There was marked discordance between MinION and Pacbio when the repeat unit size was one or two (Figure 8a, Supplemental Figure 6). This is probably because MinION tends to make small deletion errors and PacBio small insertion errors, which are hard to distinguish from copy number changes of these tiny repeat units. Note that a repeat unit size of one means homopolymers (such as AAAAAAAAAA). The triplet-repeat distribution showed a slight difference between MinION and PacBio ( Figure 8a). However, where the repeat unit is longer than 3, MinION and PacBio had similar distributions of copy number changes (Supplemental Figure 6), so both sequencing platforms work on these tandem-repeats. We also verified that the GGGGCC strand bias, which we observed in the plasmids, is also seen in the rel3 dataset. The distribution of GGGGCC copy number change showed a slight difference between forward and reverse strands (Supplemental The MinION data we tested (rel3) was published in 2017. Nanopore base-callers and chemistries have been greatly improved recently. We tested a recent nanopore MinION dataset analyzed by MinKNOW1.11.5, which uses the Albacore 2.0 base-caller. The strand bias for AC, AG, CCT and CTT was greatly improved (Supplemental Figure 10). We also tested a recently published human genome dataset sequenced by ONT's new high throughput sequencer PromethION. We also found that strand biases for AC, AG, CTT and CCT are greatly improved in PromethION reads (Figure 8b,d, Supplemental Figures 11,   12).

Computation time and repeat-masking
The slowest computational step was aligning the reads to the genome (lastal). For some datasets, we made it much faster by "masking" repeats (both interspersed and tandem) with WindowMasker. Here, "masking" means that the repeats (indicated by lowercase letters) are excluded from the similarity-search steps of the alignment algorithm, but are included when making the final alignments: the hope is to find the same alignments faster.
In practice, this masking is often harmless, but sometimes harmful. It did not prevent us from detecting the BAFME expansion, or expansions at 10 other disease loci in chimeric human/plasmid reads. On the other hand, it prevented detection of the SCA10 expansions (result not shown). This is because one flank of the SCA10 tandem repeat consists of transposable elements and is almost completely masked. Note this dataset has somewhat 39 short reads (Table 2): the problem would be solved by longer reads that extend beyond the masked region.
When we do not mask, the total run time for our analysis is competitive with those of RepeatHMM and PacmonSTR (Supplemental Table 3). (Note the last-train run time does not increase much for larger datasets, because it uses a fixed-size sample of the data.) When we do mask, the computation is much faster (Supplemental Table 4), and usually the results do not change significantly (Supplemental Figure 13).

Discussion
We have presented several lines of evidence that we can robustly detect pathological expansions of tandem repeats. We successfully detected them in: constructed plasmids, semi-artificial plasmid/human sequences, and real human sequences from patients with SCA10 and BAFME. We also did not detect unexpected (false-positive) large known-pathological expansions in three whole-genome datasets: PacBio reads from a BAFME patient, and PacBio and nanopore reads from NA12878. Importantly, we can also rank copy-number changes by priority, such that pathological expansions are ranked near the top out of ~0.7 million tandem repeats in the genome.
Our method is not specific to tandem expansions, but detects any kind of expansion of a tandem repeat. For example, we detected an expansion due to insertion of an Alu SINE within a tandem repeat (Figure 5a). We also detected an expansion that contained non-repeat sequence, which turned out to be a deletion in the reference genome (Figure 5b). Such non-tandem expansions may impact genomic function and health, so we believe it is useful to detect them too during first-round genome-wide screening.
If a repeat expansion is actually the ancestral state, with the reference genome having a contraction or deletion (e.g. Figure 5b), then it is plausible that the expansion is less likely to be pathological. Thus, our prioritization of copy-number changes likely benefits from comparing the changes to ape genomes. An ancestral reference human genome would be ideal 14 . A similar idea is to de-prioritize expansions commonly present in healthy humans ( Figure   7a, b): this will become more powerful as tandem repeat data accumulates.
We have also pointed out some interesting difficulties with analyzing tandem repeat sequences. Some DNA reads do not align to the repetitive region of the reference genome (e.g. Figure 6b), and systematic sequencing errors may turn a tandem repeat into a different tandem repeat. The analysis becomes harder when the reference sequence next to an annotated tandem repeat resembles the sequence in the repeat (e.g. Figure 6c). Some (inexact) tandem repeats do not have unambiguous boundaries, and different annotations (e.g. RepeatMasker versus Tandem Repeats Finder 25 ) sometimes disagree on the boundaries. In some cases, there may be no unambiguous distinction between expansion of a tandem repeat and sequence insertion near the repeat.
Systematic sequencing errors can have different effects on the two strands of tandemly-repeated DNA, causing the predicted copy number changes to have a bimodal distribution (e.g. Figure 1i). So it is important to indicate which predictions come from which strands, in order to not misinterpret this as two 42 alleles. We report length and strand biases of several long-read sequencers for every possible type of di-and tri-nucleotide repeat: these biases are prominent for specific repeats (e.g. CTT and CCT in older MinION data), and the worst biases are greatly improved in more recent sequencing systems.
If sequencing accuracy continues to improve, tandem repeat analysis will obviously benefit. The alignment will automatically become faster, due to lower tolerance of gaps and substitutions. Copy number will be predictable more accurately and with lower coverage.
This study demonstrates a practical and robust way to identify changes in tandem repeats that may have biologically impactful consequences. Although there are still limitations due to the developing sequencing technologies and cost to immediately apply this approach in clinical sequencing, we clearly show that there is hope that long read sequencing is useful to identify overlooked changes in the genome, and may give an answer to the large numbers of patients with genetic diseases whose causes and mechanisms have remained unsolved for many years.

Competing interests
None of the authors have competing interests. Distribution of predicted change in repeat copy number, for nanopore reads from each of ten plasmids. Red arrows: projected copy number change. Forward (red) and reverse strand reads (blue) are shown separately. y-axis: read count,

Figure legends
x-axis: change in copy number relative to the reference plasmid. Reference copy numbers in each plasmid are in Supplemental Table 1. Black arrows: reads in these peaks may actually have shortened repeats.  Distribution of predicted change in repeat copy number, for nanopore reads of human DNA with inserted repeats. Reads covering each of ten disease-associated repeat loci were selected, and the repeat region in each read was replaced by: the repeat region of a plasmid nanopore read (a-j top panels), or exact repeats (a-j bottom panels). y-axis: read count, x-axis: change in copy number relative to the reference human genome. Forward (red) and reverse strand reads (blue) are shown separately.   Alignments of DNA reads (vertical) to the reference human genome (horizontal).
Diagonal lines indicate alignments, of same strands (red) and opposite strands (blue). The vertical stripes indicate repeat annotations in the reference genome: tandem repeats (purple) and transposable elements (pink). (a) Six reads from a BAFME patient that cover the disease-causing SAMD12 AAAAT repeat locus.