Sequencing and characterization of the FVB/NJ mouse genome
© Wong et al.; licensee BioMed Central Ltd. 2012
Received: 30 May 2012
Accepted: 23 August 2012
Published: 23 August 2012
Skip to main content
We’re sorry, something doesn't seem to be working properly. Please try refreshing the page. If that doesn't work, please contact us so we can address the problem.
© Wong et al.; licensee BioMed Central Ltd. 2012
Received: 30 May 2012
Accepted: 23 August 2012
Published: 23 August 2012
The FVB/NJ mouse strain has its origins in a colony of outbred Swiss mice established in 1935 at the National Institutes of Health. Mice derived from this source were selectively bred for sensitivity to histamine diphosphate and the B strain of Friend leukemia virus. This led to the establishment of the FVB/N inbred strain, which was subsequently imported to the Jackson Laboratory and designated FVB/NJ. The FVB/NJ mouse has several distinct characteristics, such as large pronuclear morphology, vigorous reproductive performance, and consistently large litters that make it highly desirable for transgenic strain production and general purpose use.
Using next-generation sequencing technology, we have sequenced the genome of FVB/NJ to approximately 50-fold coverage, and have generated a comprehensive catalog of single nucleotide polymorphisms, small insertion/deletion polymorphisms, and structural variants, relative to the reference C57BL/6J genome. We have examined a previously identified quantitative trait locus for atherosclerosis susceptibility on chromosome 10 and identify several previously unknown candidate causal variants.
The sequencing of the FVB/NJ genome and generation of this catalog has increased the number of known variant sites in FVB/NJ by a factor of four, and will help accelerate the identification of the precise molecular variants that are responsible for phenotypes observed in this widely used strain.
The origins of the FVB/NJ line can be traced back to an outbred colony of Swiss mice (N:GP, NIH general purpose) established at the National Institutes of Health (NIH) in the 1930s. A second colony (N:NIH) was subsequently established from the N:GP strain in the 1940s and then in the 1960s two new sub-lines (HSFS/N and HSFR/N) were established by selecting for sensitivity to histamine diphosphate challenge following pertussis vaccination. After eight generations of inbreeding, histamine-sensitive mice (HSFS/N) were found to carry the Fv-l b allele for sensitivity to the B strain of Friend leukemia virus [1, 2]. These mice were bred to homozygosity and designated as the FVB/N inbred strain. FVB/N mice were imported to the Jackson Laboratories in 1988 at F37 and in 1991 were re-derived at N50 for addition to the Jackson foundation stocks. The FVB/NJ mouse has since become the workhorse of many mouse genetics laboratories due to its high reproductive performance, large litter sizes, and prominent pronuclei in fertilized eggs that make it an ideal strain for transgenic strain production [1, 3]. FVB/NJ also represents an important disease model, and this strain has been used in a range of congenic and inter-strain crosses to identify disease-causing genes.
One well known phenotype of FVB/NJ mice is early onset retinal degeneration, which has been attributed to a nonsense mutation in the Pde6b gene . FVB/NJ mice also fail to secrete complement 5 due to a 2-bp deletion in the Hc gene, which causes a truncation of the protein [1, 5]. Resistance to collagen-induced arthritis, observed in FVB/NJ, has been shown to be due to a single nucleotide polymorphism, a 3-bp indel, and a large deletion in the T-cell receptor variable regions . FVB/NJ mice also display a number of phenotypes relating to human disease, including susceptibility to seizures, diseases of the central nervous system , mammary hyperplasia  and with age, the spontaneous development of tumors in a range of organs, most commonly in the lung . In addition, FVB/NJ mice are highly susceptible to chemically induced skin tumors , and are resistant to atherosclerosis .
Recent advances in sequencing technologies have reduced the cost of genome resequencing to a fraction of the cost required to produce the C57BL/6J mouse reference genome [12, 13]. Next-generation sequencing technologies have been used to sequence the genomes of 17 common laboratory mouse strains, providing the most detailed picture of molecular variation between mouse strains to date [14, 15]. This resource has already been used to accelerate the process of identifying candidate functional variants . To complement this resource we sequenced the FVB/NJ genome. Here we describe a high quality catalogue of SNPs, insertion/deletion polymorphisms (indels) and structural variants (SVs) for this important strain. We compare these variants to those discovered in the 17 mouse genomes to profile the level of private variation in this strain, and demonstrate how our catalogue of FVB/NJ variants can be applied to identify and prioritize putative causative variants at quantitative trait loci (QTL) by examining the chromosome 10 atherosclerosis susceptibility locus, Ath11.
The FVB/NJ mouse genome was sequenced to a depth of approximately 50-fold coverage using 100-bp paired-end reads generated by the Illumina HiSeq 2000 sequencing platform  (European Nucleotide Archive accession ERP000687). The sequencing reads were mapped to the C57BL/6J mouse reference genome (NCBIM37/mm9) with BWA , followed by local realignment of reads around indels discovered by the Mouse Genome Project (MGP) using the Genome Analysis Toolkit (GATK) . SNP and indel discovery was performed using the SAMtools mpileup function and BCFtools  (dbSNP handle: SC_MOUSE_GENOMES).
We have also catalogued 0.82 million small indels in the FVB/NJ genome. The 0.38 million insertions range in size from 1 bp to 34 bp and 0.44 million small deletions range in size from 1 bp to 56 bp. We also used iPLEX Gold Assay to estimate the false positive rate of our indels. We tested a random sample of 100 short indels and estimate the false positive rate to be 10.1% (Materials and methods).
Predicted functional consequence of SNPs and indels
5 kb upstream or downstream
5' or 3' UTR
Essential splice site
In-frame codon insertion or deletion
Two or more consequences
Within non-coding gene or mature microRNA
5 kb upstream or downstream of non-coding gene
Intergenic (>5 kb from a coding or non-coding gene)
Enrichment of PANTHER ontology terms from a list of 394 genes with radical amino acid substitutions (Grantham matrix score >150)
Percentage of total
MF00002:G-protein coupled receptor
MF00224:KRAB box transcription factor
Similar to the SNPs, the majority of FVB/NJ indels are intergenic (50.4%) or within 5 kb upstream or downstream of a gene (14.3%) (Table 1). Only a small number of indels cause frameshifts (126) or in-frame codon losses or gains (140), including the known 'TA' deletion in the hemolytic complement gene (Hc; starting at position 34,898,728, chromosome 2), which is associated with susceptibility to allergen-induced bronchial hyper-responsiveness . Manual inspection of the three indels predicted to affect splice sites revealed that two of the affected genes are likely to be a pseudogene and a long intergenic non-coding RNA, and the third is built from an incorrect gene model (Table s3 in Additional file 1).
We computationally determined the genotypes of the 17 MGP strains at all FVB/NJ variable SNP sites (Materials and methods) to identify SNPs and indels unique to FVB/NJ ('private' SNPs and indels). To obtain a high-confidence list, we classified a SNP site as private to FVB/NJ only if all 17 MGP strains were called as high quality homozygous reference alleles, as described for the FVB/NJ data in the Materials and methods. Using these criteria, we identified 115,228 private SNPs in FVB/NJ, which is 2.7% of the total number of SNPs called in this strain (Figure 1). From Sequenom genotyping of 103 sites, we estimate the false positive rate for private SNPs to be 3.0 to 5.8% (Materials and methods).
Non-synonymous single nucleotide variants unique to FVB/NJ, relative to the 17 Mouse Genomes Project strains , with Grantham matrix score >150
Sodium channel, voltage-gated, type X, alpha
Ring finger protein 186
Olfactory receptor 1469
Forkhead box A3
Amyloid beta (A4) precursor protein-binding, family B, member 1
Rho guanine nucleotide exchange factor (GEF) 5
Ubiquitin-conjugating enzyme E2L 6
Carnitine palmitoyltransferase 1a, liver
Ribosomal protein 37, pseudogene 1
Erythroblast membrane-associated protein
Olfactory receptor 1454
Deltex 4 homolog (Drosophila)
Zinc finger protein 521
Olfactory receptor 922
Dual specificity phosphatase 27 (putative)
We also identified 8,172 private indels using the criteria described above. Similar to the private SNPs, the majority of private indels are intergenic (45.6%), intronic (24.0%), within 5 kb of a coding gene (11.4%), or have multiple consequences (18.0%). Only seven of these affect all transcripts of a protein-coding gene: two are in-frame codon insertions or deletions and five result in frameshifts. All of the affected genes are olfactory receptors, with the exception of an in-frame 3-bp deletion in Hr, the hairless gene.
Structural variants in the FVB/NJ genome
Deletion in gain
Deletion + insertion
Inversion + deletion or insertion
When compared to the 17 strains sequenced in the MGP, 8,060 SVs in FVB/NJ were not identified in any other strain, with the majority (6196, 76.8%) of these being insertions. This list is likely to be composed of true private FVB/NJ insertions, true insertions in FVB/NJ that were missed in one or more of the 17 MGP strains, and false insertion calls. We used two software programs to call insertions: RetroSeq  and SECluster (unpublished) (Materials and methods). In order to estimate the false discovery rate of our insertion calls, we randomly selected 50 insertions from calls made by both RetroSeq and SECluster and 50 insertions each from calls made exclusively by RetroSeq and SECluster for PCR validation, giving 150 insertions in total. An insertion in the FVB/NJ genome was not observed for 6 of the 120 (5%) primer pairs that produced a band in both the reference C57BL/6J and FVB/NJ. From this we can infer that the majority of the approximately 8,000 insertions observed in FVB/NJ and not the 17 MGP strains are likely to be real. The false negative rates from the MGP showed that insertions were more difficult to identify than deletions, with false negative rates ranging from 24% to 32% for insertions, compared to 15% to 20% for deletions in 7 founder strains of the heterogeneous stock . Therefore, a large portion of the approximately 8,000 insertions present only in FVB/NJ may be false negatives in one or more of the 17 MGP strains, rather than true private insertions. Several factors contribute to our improved sensitivity for identifying large insertions in FVB/NJ, compared to the MGP: higher sequencing depth (approximately 50× coverage compared to an average 25×), longer read length (100-bp HiSeq reads compared to primarily 54- to 76-bp Genome Analyzer II reads), and improvements in read alignment algorithms (BWA versus MAQ). Additionally, both RetroSeq and SECluster were also used in the MGP; however, both software have been adapted to take advantage of the additional information provided by both the longer reads and the ability of BWA to align portions of reads that flank insertion breakpoints (Materials and methods).
The search for genes of interest in QTL mapping experiments involving FVB/NJ mice is greatly facilitated by having a complete catalogue of genome-wide variants. To this end, we have re-examined a QTL for atherosclerosis susceptibility on chromosome 10, Ath11. QTLs for atherosclerosis susceptibility have been identified on chromosomes 1, 10, 14, 15 and 18 using inter-crosses between C57BL/6J (atherosclerosis-susceptible) and FVB/NJ (atherosclerosis-resistant) mice on Apoe -/- and Ldlr -/- knockout backgrounds [33–35]. Further studies of Ath11 using subcongenic mice allowed Wolfrum et al.  to refine the congenic interval (58.3 Mb) into two smaller regions, 10a (7.3 Mb), which is female-specific with 21 genes, and 10b (1.8 Mb), which contains 7 genes and is operative in both genders. The authors also examined differential expression of genes in these intervals using aortic tissues from F1 offspring. Wolfrum et al. searched for candidate causative SNPs in the two regions by mining public data resources (primarily dbSNP128) for sites at which C57BL/6J and FVB/NJ differed. They identified 22 SNPs affecting coding regions, intronic splice sites, or 5' or 3' UTRs. Additionally, they identified 31 potentially polymorphic sites, those which are known to be polymorphic but of unknown genotype in FVB/NJ. The authors listed genes of interest in the 10a and 10b regions as those on the Cardiovascular Gene Ontology Annotation Initiative's list of cardiovascular-associated genes: Ipcrf1, Oprm1, Mtrf1l, Syne1, Esr1, Mthfd1l, Pde7b, Myb, Aldh8a1, and Sgk1. The Esr1 gene, estrogren receptor α, in the 10a female-specific region was noted as a promising candidate gene, as it was identified as a regulator of one of the two gene networks identified from differentially expressed genes in the aortas of atherosclerosis-resistant and atherosclerosis-susceptible congenic mice. However, their database search revealed only one synonymous SNP and one potential synonymous SNP between FVB/NJ and C57BL/6J in Esr1. Aldh8a1, which was differentially expressed between congenic strains, was found to have one synonymous SNP and one 3' UTR SNP, but did not have any non-synonymous SNPs. In the 10b region, only one non-synonymous SNP was found in Myb, and another potential non-synonymous SNP was found in Hbs1l, while the remaining SNPs were synonymous SNPs or affected an intronic splice site or 5' or 3' UTRs.
Non-synonymous SNPs between C57BL6/J and FVB/NJ in Ath11
Chromosome 10 position
Gene name, Ensembl gene ID and description
Synaptic nuclear envelope 1
Estrogen receptor 1
Nuclear hormone receptor, ligand-binding, core
Zinc finger and BTB domain containing 2
Methylenetetrahydrofolate dehydrogenase (NAD P+ dependent) 1-like
Tetrahydrofolate dehydrogenase/cyclohydrolase, NAD(P)-binding domain
Methylenetetrahydrofolate dehydrogenase (NAD P+ dependent) 1-like
Tetrahydrofolate dehydrogenase/cyclohydrolase, catalytic domain
Low density lipoprotein receptor-related protein 11
Polycystic Kidney Disease(PKD) domain
Aldehyde dehydrogenase 8 family
Aldehyde dehydrogenase domain
RIKEN cDNA 4930444G20 gene
RIKEN cDNA 4930444G20 gene
RIKEN cDNA E030030I06 gene
RIKEN cDNA E030030I06 gene
RIKEN cDNA E030030I06 gene
We have sequenced the FVB/NJ mouse genome and catalogued SNPs, indels, and SVs. For SNPs alone, our study has increased the number of known variant sites by a factor of four for the FVB/NJ strain and therefore will serve as a valuable resource, as the FVB/NJ mouse strain is widely used for the generation of transgenic mice and in QTL mapping. We have shown how this resource can be used to characterize a QTL, accelerating the identification of candidate causal variants.
FVB/NJ DNA was obtained from the Jackson Laboratories (#1800; pedigree: 10-00964; generation: F95pF98) from a female FVB/NJ mouse. DNA (1 to 3 μg) was sheared to 100 to 1,000 bp using a Covaris E210 or LE220 (Covaris, Woburn, MA, USA) and size selected (350 to 450 bp) using magnetic beads (Ampure XP; Beckman Coulter). Sheared DNA was subjected to Illumina paired-end DNA library preparation and PCR-amplified for six cycles. Amplified libraries were sequenced using the HiSeq platform (Illumina) as paired-end 100 base reads according to the manufacturer's protocol. Each sequencing lane was genotype checked against the Perlegen SNP calls using the SAMtools programs BCFtools/glfTools . A list of libraries and sequencing statistics is available in Table s5 in Additional file 1.
Sequencing reads from each lane were aligned to the C57BL/6J reference genome (NCBI build M37/mm9) using BWA version 0.5.9-r16 and the parameters '-q 15 -t 2'. To improve SNP and indel calling, the GATK  'IndelRealigner' was used to realign reads near indels from the MGP . The BAM files were then re-sorted and quality scores were recalibrated using GATK 'TableRecalibration'. Finally, SAMtools 'calmd' was used to recalculate MD/NM tags in the BAM files. All lanes from the same library were then merged into a single BAM file using Picard tools  and PCR duplicates were marked using Picard 'MarkDuplicates'. Finally, the library BAM files were merged into a single BAM containing all FVB/NJ sequencing reads.
SNPs and indels were identified using the SAMtools mpileup function, which finds putative variants and indels from alignments and assigns likelihoods, and BCFtools , which applies a prior and performs the variant calling. The following parameters were used: for SAMtools mpileup '-EDS -C50 -d 1000' and for BCFtools view '-p 0.99 -vcgN'.
Variants and indels were filtered using 'vcf-annotate' from the VCFtools package . Filters and cutoff values are listed in Table s6 in Additional file 1. These filters are designed to identify inaccessible or uncallable sites and remove false SNP and indel calls due to alignment artifacts. Only homozygous SNPs and indels were retained.
To determine the effect of variants and indels on transcripts, we used the Ensembl Variant Effect Predictor tool version 2.2  against mouse gene models from Ensembl version 64. Grantham scores were also generated to predict the impact of the amino acid substitutions.
SNPs and indels that are unique to FVB/NJ were identified by realigning the 17 MGP data using BWA, and performing recalibration and realignment around indels, as described above for the FVB/NJ data. The 17 MGP strains were then genotyped at all FVB/NJ SNP and indel sites using the SAMtools mpileup and BCFtools pipeline described above. To obtain a list of high confidence private FVB/NJ SNPs, we required each site to be genotyped as a homozygous reference allele in all 17 MGP strains, with a genotype quality of at least 30 and supporting read depth of 5 or more reads. The same criteria were applied to find private FVB/NJ indels.
Genotyping was performed using the iPLEX™ Gold Assay (Sequenom® Inc.) . Assays for all SNPs were designed using the eXTEND suite and MassARRAY Assay Design software version 3.1 (Sequenom® Inc.). Primers were designed from 100 bp of sequence flanking the SNP or indel of interest; FVB/NJ SNPs and indels in the flanking regions were masked, in addition to repetitive regions. Amplification was performed in a total volume of 5 μl containing approximately 10 ng genomic DNA, 100 nM of each PCR primer, 500 μM of each dNTP, 1.25× PCR buffer (Qiagen Crawley, West Sussex, UK), 1.625 mM MgCl2 and 1U HotStar Taq® (Qiagen). Reactions were heated to 94°C for 15 minutes followed by 45 cycles at 94°C for 20 s, 56°C for 30 s and 72°C for 1 minute, then a final extension at 72°C for 3 minutes. Unincorporated dNTPs were SAP digested prior to iPLEX™ Gold allele specific extension with mass-modified ddNTPs using an iPLEX Gold reagent kit (Sequenom® Inc.). SAP digestion and extension were performed according to the manufacturer's instructions with reaction extension primer concentrations adjusted to between 0.7 and 1.8 μM, dependent upon primer mass. Extension products were desalted and dispensed onto a SpectroCHIP using a MassARRAY Nanodispenser prior to MALDI-TOF analysis with a MassARRAY Analyzer Compact mass spectrometer. Genotypes were automatically assigned and manually confirmed using MassARRAY TyperAnalyzer software version 4.0 (Sequenom® Inc.).
The false positive rate for SNP and indel discovery in FVB/NJ was estimated by randomly selecting 150 SNPs and 100 indels for genotyping in FVB/NJ and C57BL/6J using the Sequenom MassARRAY iPLEX Gold Assay , as described above. A genotype call was made for 128/150 SNPs and 69/100 indels. The concordance rate was 98.4% (126/128) for SNPs and 89.9% (62/69) for indels, giving false positive rates of 1.6% and 10.1%, respectively. However, one of the two discordant calls was due to a heterozygous call from the iPLEX Gold Assay. As we expect SNPs in inbred mice to be homozygous, this is likely to be an erroneous genotype call. We excluded this site to calculate the lower boundary of our SNP false positive rate, 0.8% (1/127 discordant calls). We also selected 111 private FVB/NJ SNPs, of which 103 genotype calls were made. The private SNPs chosen for genotyping only included non-synonymous SNPs, SNPs that create premature stop codons, or SNPs affecting splice sites. The concordance rate was 94.2% (97/103). Three of the discordant calls were heterozygous calls by the iPLEX Gold Assay, and excluding these calls gives a concordance rate of 97% (97/100).
We compared our catalogue of FVB/NJ SNP calls to FVB/NJ genotypes from the Perlegen/NIEHS data set . There are 996,981 sites genotyped as homozygous non-reference SNPs in this data set, and of these, 91.7% (914,225) are present in our FVB/NJ catalogue. Of the remaining, 2.7% (26,869) fell into 'uncallable' sites (see Materials and methods above) and 5.6% (55,887) were not in our SNP catalogue. We randomly selected 100 SNP sites from the 5.6% discordant sites for SNP genotyping using the iPLEX Gold Assay. Of the 93 that produced genotype calls, 90 sites (96.8%) were genotyped as homozygous reference alleles. Assuming that this portion of the 5.6% discordant genotypes are homozygous reference bases, then our false negative rate at accessible sites in the Perlegen/NIEHS data set is less than 1%. Two of the three discordant genotypes were called by the iPLEX Gold Assay as heterozygous SNPs in the FVB/NJ sample.
SVs were identified using SVMerge , in which we applied a combination of BreakDancer , CND , RetroSeq  and SECluster (unpublished), followed by filtering and local sequence assembly for all deletions and insertions (from SECluster) to obtain exact breakpoints, all as described previously [15, 28]. For this analysis, SECluster was optimized to work with reads mapped with BWA  by including mate pairs with one mate soft-clipped and read pairs with non-inward facing orientation for read clustering and insertion calling. RetroSeq was also updated to include soft-clipped reads in the breakpoint resolution step. The parameters for all SV callers are shown in Table s7 in Additional file 1. Tandem duplications, inversions, and more complex paired-end mapping patterns were also identified using in-house Perl scripts, as described in Yalcin et al. . Mouse genome reference assembly gaps, centromere and telomere regions were obtained from the University of California Santa Cruz Table Browser . Structural variants overlapping these regions, plus a window on either side of 500 bp for assembly gaps and 20 kb for centromeres and telomeres, were excluded, as mapping artifacts in these regions cause false SV calls.
For analysis of genes with radical amino acid substitutions, a list of 394 Ensembl gene IDs was submitted to the DAVID website [24, 25] (version 6.7) for functional annotation, and 368 mapped to genes in their database. For structural variants overlapping genes, a list of 415 Ensembl gene IDs was submitted to the DAVID website, and 331 mapped to genes in their database. The default EASE threshold score was used (0.1) and the minimum number of genes for a term was set to 2.
PCR primers to validate insertions were designed using Primer3 release 2.2.3  and an in-house Perl script. The optimal primer length was set at 20 bp and all other Primer3 defaults were used. FVB/NJ SNP sites and structural variants were masked. Primers were designed to have a product size between 200 and 1,000 bp, relative to the reference genome. Candidate primer pairs were checked for uniqueness in the mouse reference genome, and insertions with no unique primer pairs were excluded from validation.
PCR was performed on C57BL/6J and FVB/NJ genomic DNA (The Jackson Laboratory, Bar Harbor, Maine, USA) using either Thermo-Start Taq DNA Polymerase (Abgene, Epsom, UK) with an annealing temperature of 60°C and extension time of 30 s (for 35 cycles) or Platinum® Taq DNA Polymerase High Fidelity (Life Technologies Limited, Paisley, UK) with an annealing temperature of 60°C and extension time of 5 to 8 minutes (for 35 cycles), according to the manufacturers instructions. The PCR products were run on 1 to 2% agarose gels containing ethidium bromide, and visualized using a UV transilluminator. The approximate sizes of the PCR products were calculated by running molecular weight markers (Hyperladder™ I; Bioline Reagents Limited, London, UK) on each gel.
We used the BioMart tool from Ensembl build 64 to retrieve genotypes for the FVB/NJ strain in the chromosome 10 Ath11 10a and 10b regions, from positions 1 to 7,300,000 bp and 20,100,000 to 21,900,000 bp. We queried the Ensembl Mus musculus Variation 64 database, which is generated from dbSNP Build 128 . From this list we identified sites with non-reference alleles, and compared these variants to the set of FVB/NJ variants we generated as described above.
Database for Annotation: Visualisation and Integrated Discovery
Genome Analysis Toolkit
Grantham matrix score
Mouse Genome Project
National Institute of Environmental Health Sciences
National Institutes of Health
polymerase chain reaction
quantitative trait locus
single nucleotide polymorphism
We would like to acknowledge the efforts of the library and sequencing teams, core IT teams and the other members of the Vertebrate Resequencing Informatics group at the Wellcome Trust Sanger Institute. This work was supported by the Medical Research Council, UK and the Wellcome Trust. DJA is supported by Cancer Research-UK and the Wellcome Trust.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.