Genome sequencing data
A summary of the next generation sequencing (NGS) data produced in this study is shown in Additional file 1. We previously sequenced the SHR genome with median coverage of 10× [10]. By expanding the NGS data set of SHR the median coverage of this genome was increased to 23× (Figure 1a). For the analyses of the BN-Lx genome we produced three independent fragment libraries and a long mate-pair library. Combined this resulted in 33× median base coverage of this genome (Figure 1a). The BN/NHsdMcwi strain has been used to create the reference genome of the rat [11]. We refer to this strain as the reference BN. This strain was sequenced by traditional capillary sequencing and has not been covered to the same depth as we generated here for BN-Lx and SHR. Lack of coverage in the reference may contribute to false discovery of genomic variants in BN-Lx and SHR. To prevent false discoveries, we used novel whole genome NGS data that were generated from material from the same animal (named Eve) that was used to create the BN reference sequence. We used these data from the BN reference with 32× NGS coverage to filter for likely reference genome errors.
Small genomic variants
We used two variant calling algorithms for SNV detection, which were employed on the SHR, BN-Lx and BN reference data. For each SNV position found in either BN-Lx or SHR we checked that the SNV was not found in the reference BN sample. We took the SNVs found by both callers to come to the final SNV set reported here (Additional file 2 contains a list of all SNVs in SHR; Additional file 3 contains a list of all SNVs identified in BN-Lx). The results of both SNV callers can be viewed in separate tracks in the Rat Genome Database [12].
We detected 15,627 homozygous SNVs unique to BN-Lx and 3,162,031 unique to SHR (Figure 1b) as compared, respectively, to the BN reference, adding up to a total of 3,177,658 SNVs within the RI panel. Capillary sequencing confirmed 70 of the 70 randomly selected homozygous SNVs (100%; Additional file 4). Although inbred strains are expected to be homozygous at each position, we also found 17,544 heterozygous positions specific for SHR and 225 specific for BN-Lx. Possible sources of heterozygous SNVs include recent de novo mutations, sequencing noise, incomplete inbreeding and diverged duplications. The latter two mechanisms are expected to result in clustering of heterozygous positions, whereas noise and de novo mutations would appear randomly. We analyzed the genome of SHR in 100 kb windows and found that the distribution of heterozygous SNVs is far from random; 84% of the heterozygous SNVs locate to windows with more than three SNVS. Of simulated randomly positioned SNVs, only 2.6% are located in such SNV-dense windows. Furthermore, 91 of the 148 genomic regions with the highest density of heterozygous SNVs (top 1%) were found to overlap with duplications, indicating that diverged duplications are a major source of heterozygous SNVs.
The homozygous SNVs between BN-Lx and the BN reference genome are also not distributed randomly across the genome. There are eight regions with a high SNV density, with a combined length of 51 Mb, together holding 97% of the SNVs (Figure S1A in Additional file 5). One of these regions, on chromosome 8, was expected to have a high density of SNVs because BN-Lx is known to be congenic for this region, which contains the Lx locus [4]. The discovery of the other seven regions demonstrates that there are more sub-chromosomal differences between BN-Lx and the reference than was expected based on the breeding history. These differential regions could be remnants of the original cross with the PD strain that created the congenic strain that was the founder for BN-Lx. To investigate this possibility, we analyzed a subset of the SNVs in other rat strains. Saar et al. [13] previously characterized 127 SNV positions in the 7 regions that differ between BN-Lx and the reference genome by high-throughput genotyping in 167 strains and substrains. Nine BN substrains from different institutes were included in this study. We found that a subset of the BN substrains has the same genotype in the seven polymorphic regions as BN-Lx (Figure S1B in Additional file 5). This shows that variants likely reflect the background of the BN strain that was used as a founder for generating the BN-Lx strain rather than remnants of the PD strain.
We did not find genomic regions with increased SNV density in SHR. We detected 220,332 additional SNVs that were not found in the analyses based on the 10× coverage of the SHR genome [10]. We also found 76,019 positions that were previously called homozygous SNV to be also non-reference in BN (6,077 positions), BN-Lx (28,751 positions) or both (41,191 positions), indicating that these most likely reflect errors in the reference genome assembly.
Next, we used the overlap between the two independent analysis tools to reliably call indels in the two rat strains. We identified 1,615 indels specific for BN-Lx and 424,309 specific for SHR (Figure 1b; Additional files 6 and 7 for SHR and BN-Lx, respectively). We analyzed 57 indels with capillary sequencing and confirmed all 57 (100%; Additional file 8). We called indels ranging from one to ten base pairs; 36% of the indels are larger than 1 bp (Figure 1c) and a large proportion (51%) of the indels are changes in the length of homopolymer stretches longer than 3 bp (82% of these involve A or T stretches).
Fifty percent of the SHR indels were not found in the previous analysis, and 27,416 of the 343,243 indels detected in the previous analyses were found to be not specific to SHR, of which 19,138 were found in both BN-Lx and the reference strain. These differences are not only the result of a higher base coverage, but are also due to a change in analysis methods. The previous report was based on a combination of MAQ and BLAT alignments [10] and here we used two independent indel calling algorithms. Moreover, the previous set was not corrected for reference errors.
As was seen for the SNVs, the indels that are seen in BN-Lx compared to the BN reference are found clustered in the genome; 93% of the BN-Lx indels are found in the eight regions that are different between BN-Lx and the reference.
Structural variation
We detected SVs by two independent methods: based on raw read coverage and by long mate-pair (LMP) analyses.
NGS coverage differences between either SHR or BN-Lx and the reference BN can be used to detect changes in copy number. Using a dynamic window binning approach to identify genomic segments with differential NGS read coverage (DWAC-Seq; VG, manuscript in preparation), we identified four duplications in BN-Lx and four deletions compared to the BN reference, including a known 3 kb deletion on chromosome 8 [14]. In SHR we identified 384 duplications and 461 deletions (Figure 1b). The duplications ranged from 1,028 bp to 270 kb covering 7.3 Mb in total, and the deletions ranged from 1,036 bp to 2.3 Mb covering 11.7 Mb in total.
Some SVs are not detectable based on sequence coverage due to the repetitive nature of the sequence content or because they are copy neutral. Therefore, we analyzed LMP data for SHR and BN-Lx. In brief, LMP analysis involves DNA fragmentation, selection of fragments of a specific size range and sequencing of the ends of these selected fragments. The sequenced ends are aligned to the reference sequence and are expected to align at a distance from each other that matches the selected fragment size range. Mate-pairs that map at a different distance or in an unexpected orientation pinpoint changes in genome structure, which can be deletions, tandem duplications, insertions, inversions or translocations [15]. The SHR LMP library had a median insert size of 2 kb [10] and as a control we produced a LMP library of the reference BN with the same insert size. We found 76 tandem duplications, 97 inversions and 900 deletions specific for SHR. The mate-pair library of BN-Lx has a median insert size of 5.5 kb. In BN-Lx we identified 4 deletions and the large insert size of the BN-Lx mate-pair library allowed the identification of 17 insertions (Figure 1b).
The analyses on NGS coverage and LMP libraries combined results in a comprehensive list of SVs in SHR and BN-Lx versus the reference BN (Additional file 9).
Transcriptome variations in liver
To investigate the effects of the genomic variation on transcriptome characteristics, we performed RNA sequencing on liver samples of SHR and BN-Lx1. We sequenced the mRNA of three animals of each strain to control for inter-animal diversity. By polyA and 5'-cap selection we obtained the full length mRNA products from the pool of total RNA. Strand information was preserved in the library preparation procedure. We obtained over 7.4 million mapped reads for each animal and 78 to 80% of the reads mapped to coding sequences (Figure 2a).
Using normalized read counts per gene as a measure for gene expression and taking intra-strain variation into account, we identified 532 differentially expressed genes with a minimum of a two-fold difference between BN-Lx and SHR (false discovery rate (FDR) <5%). We randomly selected 12 differentially expressed genes that differed between 2 and 21 fold and analyzed them by quantitative PCR (qPCR). Expression of all 12 genes was confirmed to be significantly different between SHR and BN-Lx in the qPCR experiments (P-values 9.8 × 10E-09 to 0.04, one sided two sample t-test, assuming equal variance) and the fold changes showed a median 1.6-fold difference between the two methods (Figure 2b). The qPCR experiments thus confirm our RNA-seq analyses.
We also investigated the differences in splicing events between the two rat strains (Figure 2c). For this purpose we created a reference transcriptome consisting of all combinations of annotated exons. We mapped the RNA-seq data to this transcriptome reference and quantified sequencing reads that mapped across exon-exon junctions. Because the goal was to relate splicing differences to genetic variation, rather than to identify all splicing variants in the transcriptome, we selected the exon-exon junctions with the largest differences between the strains. We found 38 junctions with large differences in coverage between SHR and BN-Lx in 36 genes; 27 of the junctions are annotated and 11 are new combinations within a gene (Additional file 10).
Thus, there are both quantitative and qualitative differences between the transcriptomes of BN-Lx and SHR in liver tissue.
Genomic variation in coding parts of differentially expressed genes
Next we combined genomic and RNA-seq data to gain insight into potential direct effects of different types of genetic variants on the transcriptional output.
Combining the genomic sequence data from SHR and BN-Lx, we found 25,271 SNVs in coding positions, 8,830 non-synonymous coding, 129 in essential splice sites, 99 SNVs that cause a stop codon gain and 8 SNVs that cause a stop codon loss (Additional file 11 lists the affected genes and if they were expressed and differentially expressed in liver tissue).
There are 402 indels located in coding sequence, including 26 in essential splice sites and 260 indels that cause a frameshift (Additional file 12 lists the affected genes and if they were expressed and differentially expressed in liver tissue). Another 1,394 and 119 indels are located in 3' and 5' UTRs, respectively. Together, 1% of the SNVs and indels are located in coding regions or splice sites, and 489 genes are affected by essential splice site, stop or frameshift variants.
Of the SVs (which include CNVs and structural variants called in the LMP analyses), 6% contain at least parts of coding sequences, together affecting 108 genes (Additional file 13 lists the affected genes and if they were expressed and differentially expressed in liver tissue).
We investigated which types of genomic variants are most likely to alter gene expression - in other words, which have the highest predictive values. We focused on the expressed genes for each variant and determined the fraction of genes that showed differential expression levels between BN-Lx and SHR (Figure 3). Deletions and duplications are often called by both CNV and LMP analyses. We therefore combined these sets to one list of genes affected by duplications and one list of genes affected by deletions (relative contribution of both analyses methods is given in Additional file 13).
We found that full gene duplications have the highest predictive value (Figure 3). The fraction of genes that is differentially expressed is relatively low in the gene set carrying SNVs. To gain insight into these differences, we studied these effects of the different types of variants in more detail.
Small genomic variants versus differential expression levels
Novel stop codons can lead to degradation of the mRNA through nonsense-mediated decay. Stop codons can be introduced by SNVs and by frameshifts, which can result from indels or splice site changes; 193 genes carrying one of these types of variants are expressed in liver, and only 19 of these are differentially expressed between SHR and BN-Lx (Figure 3).
To verify that the lack of effect of stop mutations on transcript levels is not due to false positive genomic variants, we analyzed a subset with capillary sequencing. We sequenced 17 SNVs that were predicted to create a stop codon and were located in a gene that is expressed in liver. All 17 positions were true SNVs (Additional file 14). This demonstrates that the lack of change in expression is not due to false variant predictions.
An alternative hypothesis is that the functional predictions are invalid. Visual inspection of the RNA-seq data revealed that many of the stop variants in the unaffected genes are located in exons that are not included in the transcript in liver. Functional predictions can be improved when data on alternative transcript usage become more widely available. In addition, we found many examples where inaccurate annotation results in erroneous reading frame depictions and, as a result, in false stop codon predictions. Thus, the limited effect of predicted stop mutations on transcript levels is not due to false variant detection, but rather the result of functional predictions that are hampered by incomplete and inaccurate gene annotation.
If functional predictions are improved, the total the amount of SNVs and indels that are predicted to change stop codons will likely decrease. The fraction of the stop variants that has an effect on transcript level would then be expected to become higher. However, what can be estimated from the data presented here is that the total contribution of stop variants to transcriptome variation is low. Only 4% of the differentially expressed genes contain a stop variant. These analyses do not capture the full effects of novel stop codons. We have only investigated liver here and it is possible that the effects are larger in other tissues. Moreover, stop-codons could also result in truncated proteins, without affecting expression levels of the gene.
SVs in coding sequences of differentially expressed genes
Intuitively, a full gene deletion is the type of SV most likely to change gene expression. However, gene deletions cannot contribute to variation in the transcriptome in liver tissue if the deleted genes are not active. Of the 13 genes that are predicted to be completely deleted, only one, Uba2, is expressed in liver and this gene does not show differential expression levels between BN-Lx and SHR. This specific deletion in SHR was only detected in the mate-pair analysis and was not found based on genomic coverage. This suggests the gene sequence is in the genome of SHR, but not at the same position as in the reference genome. It is possible that this gene translocated to a different part of the genome, but is still functional. Unfortunately, there were no mate-pairs with one end inside the deletion, so we could not determine the translocation site based on the LMP data. Alternatively, it is possible that the gene is not translocated and is really deleted, but that the deletion is missed in the coverage analyses.
Gene duplications are strongly correlated with changes in expression. Five of the seven fully duplicated genes that are expressed in liver are differentially expressed. Three of the affected genes, RT1-T24-1, RT1-CE1 and RT1-CE4, reside in a highly polymorphic region on chromosome 20 coding for the MHC1 complex. This part of the genome shows a complex pattern of variation at the genomic and transcriptomic levels. We investigated the data for the other two differentially expressed genes found to be duplicated; Mx2 and Layn, in more detail (Figure 4 shows the analyses for Layn; Figure S2 in Additional file 5 shows the analyses for Mx2). Visual inspection of the coverage in the regions showed that both duplications cover only the differentially expressed gene and no other genes (Figure 4a; Figure S2A in Additional file 5). Both Mx2 and Layn show an increase in expression (Figure 4b; Figure S2B in Additional file 5). The elevated transcript levels could result from the fact that there are four copies of the genes per cell instead of two. Alternatively, the duplication of regulatory elements could result in up-regulation of the original two gene copies. To investigate these hypotheses, we analyzed the heterozygous SNVs called in the genome in the RNA-seq data. Both duplications contain heterozygous SNVs. These can arise from diversification of the sequence after duplication. Both the original and the duplicated sequence map to the same position in the reference genome and are detected as heterozygous SNVs. We focused on the heterozygous SNVs in the exons of the genes. All nine heterozygous genomic positions that are located in the Mx2 exons were also heterozygous in the RNA-seq data (Figure S2C in Additional file 5). In Layn 6/8 heterozygous positions in the exons showed expression from both alleles (Figure 4c). In BN-Lx, which does not carry the duplications, all described positions were homozygous reference at the DNA and RNA levels. These data uniquely show that the transcripts of Mx2 and Layn are produced from two different alleles in SHR; the original locus and the extra copy.
Some duplications and deletions cover only parts of genes. There are 18 expressed genes with partial duplications, of which 5 are differentially expressed. Nineteen genes with predicted partial deletions are expressed in liver and five of these are differentially expressed. One of the genes that is partially deleted is located in the CD36 locus, which has a known functional deletion [16]. The deletion in RT1-Bp causes the loss of the last exon, described below. The other three partial deletions found in differentially expressed genes were on visual fine mapping and analyses of RNA-seq data all confined to intronic sequences. Characterization of the deleted sequences showed that they all contained >2,100 bp of a repetitive element: two were remnants of long interspersed elements (LINEs) and one was a long terminal repeat (LTR) element (Figure 5). This led us to examine other deletions in introns. Of the 100 deletions that contain >2 kb of repetitive sequence, 21 are located in introns. Nine of these genes are expressed in liver and four are differentially expressed. These examples suggest a possible effect on expression levels mediated by transposon-derived sequences in introns.
The above described findings on duplications and deletions demonstrate the potential of the HXB/BXH model system for the elucidation of SV-mediated changes in gene expression.
Non-coding genomic variation and expression levels
An important part of the sequences that regulate gene expression is located outside the coding parts of genes. Therefore, we investigated the presence of genetic variation in the 5' and 3' UTRs and in the 5 kb region upstream of the transcriptional start sites. Of the expressed genes in liver tissue, 2979 have at least one SNV or indel in the 5' and or 3' UTR. Only 5% of these genes are differentially expressed. Of the total set of transcribed genes, 4% is differentially expressed. Thus, the presence of an SNV at a conserved position in a UTR does not make a gene much more likely to be differentially expressed. So what about the number of SNVs and indels in the regulatory regions? We investigated SNV and indel density in the differentially versus the non-differentially expressed gene set. We found that the differentially expressed genes contain, on average, 8.1 SNVs and 2.0 indels in the 5 kb upstream of the transcriptional start site. The number of small genomic variants is smaller in upstream regions of genes that are not differentially expressed; 6.6 SNVs and 1.7 indels on average. Although the absolute differences are small, they are highly significant (two tailed t-test, P-value 0.0003 and 0.005 for SNVs and indels, respectively). This is in accordance with the previous finding that cis-eQTL gene loci are enriched for SNVs and indels [10].
Differential splicing and splice site mutations
Next we analyzed the correlation between small genomic variants and splicing. There are 129 SNVs and 26 indels located in essential splice sites. The quantitative analyses of exon-exon junctions resulted in 36 genes with differential splicing. Only one of these contained a genomic variant, a SNV, directly in an essential splice site (Figure 6a). This sequence change resulted in the skipping of an exon in SHR (Figure 6b). The affected gene, Slc22A18, is imprinted in mouse and human and has been linked with tumorigenesis [17, 18]. The exon loss does not lead to a frame-shift but results in a partial loss of a transmembrane domain, which could be deleterious to the function of the gene.
The lack of overlap between genomic variants and splicing variation is not simply because the genes with genomic variants in essential splice sites are not expressed in the investigated tissue. Sixty-nine of the genes with SNVs and 14 genes with indels in essential splice sites are expressed, but only 8 junctions have more than 2 reads mapped to them and none have more than 7 mapped reads. This suggests that, at least in liver, many transcripts are not spliced according to the current annotation.
Alternative transcripts and genomic variants outside splice sites
The finding that the genes that are differentially spliced in BN-Lx and SHR do not contain genomic variants in essential splice sites suggests that the alternative splicing is caused by genomic variation at regulatory positions. Splicing can be regulated by splicing factors that bind to regulatory sites located in introns [19]. We therefore investigated the genomic variants in the introns of the affected sites. We selected the 27 splice junctions formed between exons that neighbor each other in the gene and, thus, cover only one intron. Of these 27 selected junctions, 17 had no genomic variant at all in the intron. To detect possible functional sites in the ten introns with genomic variants, we analyzed the phastCons9way conservation scores per variant base. This score represents the probability that a specific base is conserved, with 1.0 indicating most likely conserved [20]. All but one variant position had conservation scores below 0.1. The highest conservation was 0.26, but this SNV was not causative. In the intron with this SNV there is another SNV 10 bp before the acceptor site, which creates a new splice site and adds three amino acids to the protein, which was confirmed by sequencing the cDNA (Figure S3 in Additional data 5). The affected gene is the highly conserved Fibrinogen alpha.
Despite this one example, the vast majority of quantitative differences in splicing can not be linked with a genomic variant in the corresponding intron.
While investigating differentially expressed genes we found another source of alternative transcripts. Some of the genes that overlap SVs not only change the expression level of the annotated gene sequence, but also show novel alternative transcripts. For example, the uncharacterized gene D3ZCV5_RAT, which is duplicated in SHR, produces a different transcript than the locus in BN-Lx (Figure 6c). This gene shows differential expression, but was not included in the total analyses presented in Figure 3a because it is a novel gene without a Refseq ID. The RNA-seq data for this gene do not show that different exons are expressed in the duplicated locus in SHR (Figure 6c).
The highly polymorphic region of the MHC locus on chromosome 20 shows a mixture of different SVs, differences in expression levels and production of alternative transcripts (data not shown). There is one clear example of a partial deletion that resulted in the loss of an exon (Figure 6d), but more detailed functional analyses are necessary to elucidate the remainder of these exceptionally complex sites.