Computational comparison of two mouse draft genomes and the human golden path
© Xuan et al., licensee BioMed Central Ltd 2002
Received: 24 October 2002
Accepted: 28 November 2002
Published: 5 December 2002
The availability of both mouse and human draft genomes has marked the beginning of a new era of comparative mammalian genomics. The two available mouse genome assemblies, from the public mouse genome sequencing consortium and Celera Genomics, were obtained using different clone libraries and different assembly methods.
We present here a critical comparison of the two latest mouse genome assemblies. The utility of the combined genomes is further demonstrated by comparing them with the human 'golden path' and through a subsequent analysis of a resulting conserved sequence element (CSE) database, which allows us to identify over 6,000 potential novel genes and to derive independent estimates of the number of human protein-coding genes.
The Celera and public mouse assemblies differ in about 10% of the mouse genome. Each assembly has advantages over the other: Celera has higher accuracy in base-pairs and overall higher coverage of the genome; the public assembly, however, has higher sequence quality in some newly finished bacterial artifical chromosome clone (BAC) regions and the data are freely accessible. Perhaps most important, by combining both assemblies, we can get a better annotation of the human genome; in particular, we can obtain the most complete set of CSEs, one third of which are related to known genes and some others are related to other functional genomic regions. More than half the CSEs are of unknown function. From the CSEs, we estimate the total number of human protein-coding genes to be about 40,000. This searchable publicly available online CSEdb will expedite new discoveries through comparative genomics.
In May 2002, two new mouse genome assemblies were released. One was the second version of the mouse genome assembly from Celera Genomics, created by using both private and public sequence information (denoted Cel2 ), and the other was the third version of the assembly from the public Mouse Genome Sequencing Consortium (denoted MGSCv3 ). Both these draft mouse genomes were obtained using a whole-genome shotgun (WGS) strategy, but using different mouse strains and distinct sequence-assembly algorithms.
Assembled by direct overlapping sequence fragments, Cel2 has about 260,000 contigs with a total size of 2.51 × 109 base-pairs (2.51 gigabases (Gb)), whereas MGSCv3 has about 220,000 contigs and covers 2.475 Gb of the mouse genome. By incorporating pair-end sequence information, Cel2 and MGSCv3 contain around 47,000 scaffolds and around 43,000 supercontigs, respectively. After using physical map information, such as sequence-tagged sites (STSs), 20 mouse chromosomes were constructed together with an 'unassigned' chromosome called chromosome UA in Cel2 and chromosome Un in MGSCv3. The total sizes of these chromosomes in Cel2 and MGSCv3 are about 2.62 Gb and 2.59 Gb, respectively. The gaps in these chromosomes occupy 4.1% of genome in Cel2 and 4.5% in MGSCv3. The average size of the contigs, gaps and scaffolds/supercontigs of both the genome assemblies are comparable (data not shown).
Comparison of the two mouse assemblies
mRNA sequences can only be used to check the quality of assembly in the gene regions, and, for this, the accuracy can be determined only within exons. Therefore, we also used 39 newly finished mouse bacterial artificial chromosomes (BACs) with known chromosome locations (data from 14 May to 23 May 2002 in NCBI daily updates) to test the coverage and accuracy of long continuous regions. Although all BACs matched at the correct chromosomal locations in both assemblies, MGSCv3 exhibited higher matching quality in these BAC regions. In MGSCv3, only two BACs had less than 90% coverage: AC087780 (74% in chromosome 1) and AC099773 (84% in chromosome 5) while four BACs in Cel2 had less than 90% coverage: AC090479 (11% in chromosome 18), AC023789 (85% in chromosome 4), AC021063 (86% in chromosome 18), and AC087115 (89% in chromosome 8). For the BAC AC090479, which has only 11% coverage in Cel2, many of the matching genomic DNA fragments are in the UA chromosome at present. This is probably due to the fact that the BAC sequencing information that Cel2 used was released before July 2001. Therefore, MGSCv3 could have a better assembly in those regions where newer BAC information had been incorporated. Interestingly, these two assemblies show some complementary features. For example, BAC AC090479 was only 11% covered in Cel2 but 99% covered by chromosome 18 in MGSCv3, whereas BAC AC087780 had 74% coverage in MGSCv3 but 94% coverage in chromosome 1 in Cel2.
As well as using different assembly methods, Cel2 and MGSCv3 also used different mouse strains. This may also cause some slight differences when comparing mRNAs or BACs from different strains with Cel2 and MGSCv3.
From the above analyses of mouse mRNAs and BACs, we find some features that are complementary between MGSCv3 and Cel2. This complementarity partly originates from the fact that MGSCv3 used more BAC sequencing information, which produced higher sequence quality and coverage in the regions covered by sequenced BACs. However, the mRNA test indicates that the sequences in Cel2 are more accurate than sequences in MGSCv3 in gene regions. Thus, it is highly recommended that the two mouse genome assemblies be used in an integrated fashion rather than separately.
Comparison between the human genome golden path and the two mouse assemblies
The human genome project is well into the finishing stage , providing the opportunity to compare the mouse and human genomes. Comparing the human genome with the mouse genome can greatly help our understanding of both genomes. We used the BLASTN program  to compare the December 2001 golden path freeze of the human genome, which is also NCBI build 28, with each mouse assembly. We first used RepeatMasker (A. Smit, unpublished work, and ), to mask the repeats in all the three genomes. With a fixed expect value (E-value) of 1.0e-1 we found 1,860,560 conserved sequence elements (CSEs) between the human and MGSCv3 genome assemblies and 1,737,297 CSEs between the human and Cel2. Each CSE includes one human genome segment and one matched mouse genome segment. The simplest cases, which we call 'univalent CSEs', are matches between unique human and mouse regions. However, there are some cases, which we call 'multivalent CSEs', where more than one mouse region matches the same human region, or conversely, more than one human region matches a mouse region. For all multivalent CSEs, we chose the longest matches, and added them to univalent CSEs to make a new set, called 'primary CSE set'.
Statistics of CSE numbers and base-pairs covered
Number of CSEs
Number of human primary CSEs
Number of mouse primary CSEs
Human genome covered by primary CSEs (bp)
Mouse genome covered by primary CSEs (bp)
Number of human-mouse univalent CSEs
Human mouse univalent CSEs only found in one mouse assembly, based on the human location
MGSCv3 + Cel2
Number of human primary CSEs
Human genome covered by primary CSEs (bps)
Cross-species sequence conservation information is widely used in current gene-prediction tools, such as TwinScan , SGP-1  and SLAM (L. Pachter, personal communication). Lack of conservation information may decrease the prediction accuracy of these methods. The inconsistency of CSE locations in the human genome from the two different mouse assemblies suggests that, for reliability, it is advisable to use both assemblies to search genes and to perform other functional analyses.
Functional analysis of CSEs
Sequence conservation is usually related to functional regions of the genome, such as those comprising protein-coding genes, RNA genes and promoter regions. Using more complete conservation information, it may be possible to discover additional functional regions of the genome. In this section, we describe some salient features of the human CSEs.
CSEs in the human RefSeq genes
We used the human RefSeq gene annotation from golden path December 2001 database to explore further the possible functional implications of these CSEs. The locations of CSEs in the human genome were compared with a total of 14,653 RefSeq gene structures. We found that 94.9% of these genes and 81% of all exons (139,694) are covered by the CSEs in aCSE-hp (data not shown). These CSEs also cover 57.2% of the base-pairs in the exon regions. Coding regions (CDS) have a higher degree of CSE coverage (77.7% at the nucleotide level) than 5' untranslated regions (5' UTRs) (24%) and 3' UTRs (18%). In addition, 35.3% of the CSEs are located within RefSeq gene regions. Of these, 19.3% are exon-related, 14.8% are intron-related, and 1.2% are 'alternative exon'-related (that is, the location of the CSE in the human genome is in an intron region according to one mRNA transcript, but in an exon region according to another mRNA transcript). Here, the relationship between a CSE and an exon or an intron is defined by the location of the middle point of the CSE. Clearly, most of the RefSeq genes have at least one CSE hit.
CSEs in the known and predicted members of ETS-domain protein family
ETS-domain genes' location and related CSEs
Human genome location
CSEs in intronic regions of known genes
To investigate the possible function of CSEs in intronic regions, we examined a subset of intronic CSEs. The ten longest intronic CSEs (see Supplementary Table 1 in the additional data file) within known human genes, including RefSeq genes and other genes with known mRNA sequences, were further examined and compared with the human golden path annotation . Our analysis indicates that many long intron-related CSEs may actually be alternative exons. For example, we found that nine of these ten CSEs have spliced or unspliced human expressed sequence tag (EST) matches. Some of them also shared similarity with mouse mRNAs. There was only one CSE for which we could not get any functional information (CSE ID is chr1c_0024330, it is located at human chr1:246028930-246030945 in an intron region of a novel gene with mRNA accession number AL122093.)
CSEs, genes and transcriptional activity in human chromosome 22
CSEs between known genes
Many long CSEs that are located in the intergenic regions between the known human genes were also scrutinized against ESTs and genes in other species. Although two of the five longest intergenic region CSEs were determined to be pseudogenes, two others appeared to be gene-related. CSE chr9p_0053719 is located in the 17707984-17709816 region of human chromosome 9. In the corresponding mouse region, transcript mCT3199 was predicted in the Celera database and a zinc-finger protein basonuclin-like gene (XM_143875) was also annotated by NCBI. CSE chr6p_0056662 (human location: chr6: 51140014-51141656) also overlaps one annotated novel gene, dJ402H5.2 in NCBI. Of these five CSEs, only CSE chr6p_0077382 does not have any annotation information. However, we found a genomic region in the rat genome that matches the human location (chr6:105053471-105055327) with 94% identity, which is not much lower than 95% within this CSE. Even in the well-finished and annotated human chromosomes, such as 20 and 22, some long intergenic region CSEs can still be found. Some of them are pseudogenes, while others appear to be related to novel genes. For example, in the human genomic region around CSE chr20p_0001494, a novel gene is predicted by GenomeScan  in NCBI (XM_104356), but is not otherwise confirmed. In the whole human genome, we found 6,259 NCBI-annotated novel human gene models that are covered by at least one CSE. Although we do not know the clear functions for these novel gene-related CSEs, they obviously deserve immediate experimental verification.
CSEs in promoter regions
It is probable that many CSEs located in promoter regions contain cis-regulatory elements. For example, transcription factor 8 (TCF8) can repress interleukin 2 (IL2) expression by binding to a negative regulatory element 100 bp upstream of the IL2 transcription start site . We found that this promoter region is hit by our CSE chr4p_0043880 (its human location is chr4: 124039778-124040406). In a separate study, we are systematically investigating how to incorporate CSE information into promoter analysis.
CSEs in RNA genes
We found that 439 CSEs are related to non-translated RNA genes (from golden path April 2001 human genome annotation), such as ribosomal RNA, microRNA, small nucleolar RNA (snoRNA), and the like. For example, Z30 small nucleolar RNA  (accession number: AJ007733), located at human chr17: 34803858-34803954, is a methylation guide molecule for U6 snRNA. The whole genomic DNA region of Z30 is hit by our CSE chr17p_0010358 (human location: chr17: 34803845-34803959). By checking the corresponding mouse genome location of this CSE, we discovered the corresponding mouse Z30 gene (accession number: AJ007734). Therefore, genomic sequence conservation between different species can effectively facilitate the discovery of RNA genes.
CSEs and human genomic segment duplication
Although many CSEs are related to different functional regions in the genome, more than half of all CSEs are still mysterious. Experimental approaches and further theoretical characterizations are needed to discover the function of these conserved elements. All the CSEs and their available functional information are accessible and searchable in our CSEdb, which may be accessed through the genome browser link at .
Human gene number estimation by human-mouse comparison
Materials and methods
BLAT was used to match all mouse mRNA sequences and BAC sequences to both mouse assemblies with the default parameter setting. As the chromosome information is known for each BAC, we only calculated the coverage of BAC by the corresponding chromosome in order to check the accuracy of the assembly.
The BLAST comparison of the human golden path with Cel2 and MGSCv3 mouse genome assemblies were finished on an 80-CPU LINUX cluster in 4 days. NCBI BLAST was used with different E-values as the threshold (1.0e-10, 1.0e-4, and 1.0e-1) in this project. We found that 1.0e-1 appeared to be the best parameter for covering both gene-related and non-coding CSEs, because the average length of exons is about 150 bp and that of regulatory elements is much shorter. Another reason to choose 1.0e-1 is that we can use only those more highly significant CSEs if needed. The CSE number is super-exponentially decreased when increasing the significance from 1.0e-1 to 1.0e-10. The other options of BLAST use -b 50000 -v 50000 - U T.
Density of base-pairs in CSEs
The density of base-pairs in CSEs or Sanger Center annotated exons is calculated by counting the number of base-pairs in CSEs or exons within every 5.7-Mb window. This 5.7-Mb window is slided 57 kb in each step. Oligonucleotide array data was downloaded from the University of California Santa Cruz golden path server. The probe whose score is higher than 80 in at least one cell line is regarded as the positive probe. The following 35 base-pairs of the positive probes are counted to calculate the base-pair density in a 57-kb window .
To test whether a human region in a CSE encoded a pseudogene, we used the human genomic DNA region of this CSE to search the whole human genome with BLAT. If this continuous human region contained internal splice sites like a cDNA, we regarded this region as a potential pseudogene location.
Gene number estimation
Estimation of the number of human protein-coding genes and of human-mouse homologs is based on two assumptions. As the only training data at present are three finished human chromosomes, (chromosomes 20, 21 and 22), the first assumption is that the percentage of the gene-related CSEs in three finished chromosomes is approximately the same as the percentage in the whole genome. The second assumption is that the average CSE number per gene calculated in the three finished chromosomes is approximately the same as in the whole genome. Under these assumptions, we could estimate the total number of human-mouse homologous genes (nGENE hm ) from the total number of CSEs in a CSE-hp (nCSE a-hp ) set, the number of human primary CSEs in three finished chromosomes (nCSE 3chr-hp ) and the total genes in these three chromosomes covered by CSEs (nGENE c ):
where nGENE 3chr is the total number of annotated genes in chromosomes 20, 21 and 22, which is equal to 1,595 (the overlapping genes were only counted once and non-coding genes are not counted) from the present data. Of these, 1,291 are covered by CSEs and the total number of human primary CSEs within these three chromosomes is 25,578. Thus, we obtain the total number of human-mouse homologous genes as 590,632 × (1,291/25,578) = 29,811 at E-values less than 1.0e-1, and the total number of human genes is therefore estimated as 590,632 × (1,595/25,578) = 36,830.
We also tried to use information from these three chromosomes to estimate gene number separately, and got the mean number 35,322 with a standard deviation of 9,705.
Additional data files
Supplementary tables listing the 10 longest intron-region CSEs and 451 mouse genomic segments and their matched human segment pairs, and a figure showing the correlation between portions of CSEs and chromosome length of one chromosome in the whole genome are available as additional data file with the online version of this paper.
We thank Celera Genomics for producing and providing the access to the Cel2 data, and the Public Mouse Consortium for producing and NCBI for providing the access to the MGSCv3 data. We thank Jim Kent for providing BLAT, and L. Stein, G. Hannon, M. Wigler and E. Lander for critical reading of the manuscript. We thank Michelle Carmell for helping to proofread. Rackable kindly provided us with a powerful Linux cluster and Vsevolod (Simon) Ilyushchenko gave superb system support. We also thank members of the Zhang lab for numerous discussions. This work was supported by NIH grants.
- Celera Genomics. [http://www.celera.com]
- Ensembl mouse genome server. [http://mouse.ensembl.org]
- Kent WJ: BLAT-The BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.View ArticleGoogle Scholar
- NCBI RefSeq database website. [http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html]
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.View ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.View ArticleGoogle Scholar
- Repeatmasker. [http://repeatmasker.genome.washington.edu]
- Korf I, Flicek P, Duan D, Brent MR: Integrating genomic homology into gene structure prediction. Bioinformatics. 2001, 17: S140-S148.View ArticleGoogle Scholar
- Wiehe T, Gebauer-Jung S, Mitchell-Olds T, Guigo R: SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Res. 2001, 11: 1574-1583. 10.1101/gr.177401.View ArticleGoogle Scholar
- Xuan ZY, McCombie WR, Zhang MQ: GFScan: a gene family search tool at genomic DNA sequence level. Genome Res. 2002, 12: 1142-1149. 10.1101/gr.220102. Article published online before print in June 2002.View ArticleGoogle Scholar
- Ensembl. [http://www.ensembl.org]
- UCSC genome bioinformatics. [http://genome.ucsc.edu]
- Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, Gingeras TR: Large-scale transcriptional activity in chromosomes 21 and 22. Science. 2002, 296: 916-919. 10.1126/science.1068597.View ArticleGoogle Scholar
- Yeh RF, Lim LP, Burge CB: Computational inference of homologous gene structures in the human genome. Genome Res. 2001, 11: 803-816. 10.1101/gr.175701.View ArticleGoogle Scholar
- Williams TM, Moolten D, Burlein J, Romano J, Bhaerman R, Godillot A, Mellon M, Rauscher FJ, Kant JA: Identification of a zinc finger protein that inhibits IL-2 gene expression. Science. 1991, 254: 1791-1794.View ArticleGoogle Scholar
- Zhou H, Chen YQ, Du YP, Qu LH: The Schizosaccharomyces pombe mgU6-47 gene is required for 2'-O-methylation of U6 snRNA at A41. Nucleic Acids Res. 2002, 30: 894-902. 10.1093/nar/30.4.894.View ArticleGoogle Scholar
- M Zhang lab ftp server. [ftp://cshl.org/pub/science/mzhanglab]
- Human versus mouse conserved sequence elements (Version 2002). [http://gene.cshl.org/cgi-bin/gbrowse?source=cse]
- M Zhang lab ftp server: novel gene CSE. [ftp://cshl.org/pub/science/mzhanglab/NOVELgeneCSE]
- M Zhang lab ftp server: DuplicationSeg. [ftp://cshl.org/pub/science/mzhanglab/DuplicationSeg]
- Salamov A, Solovyev V: Ab initio gene finding in Drosophila genomic DNA. Genome Res. 2000, 10: 516-522. 10.1101/gr.10.4.516.View ArticleGoogle Scholar