- Deposited research article
- Open Access
Assembling and gap filling of unordered genome sequences through gene checking
Genome Biologyvolume 2, Article number: preprint0008.1 (2001)
The first draft of human genome sequencing is complete. A large amount of DNA sequences are already available in the database but these are not ordered and assembled. In many cases, these sequences are shorter sequences (ranging from 10kb to 100kb) and are separated by "NNNNNN". Also a considerable amount of gaps are to be filled in the subsequent years. Even after generating raw data, properly ordered, finished available sequences, are enormous tasks and expected to take another 2 years.
Here, we describe a simple way to order random genome sequences and to trace gaps. These gaps could be filled by subsequent hybridizations and sequencing. These could be achieved by a simple method by three steps. 1) Selection of large cDNAs in the database (from lower organisms to human). 2) Blasting with these large cDNAs to the unordered human genomic sequences (raw BAC DNA sequences or large DNA fragments) . 3) Ordering these BACs DNA sequences or large DNA fragments based on the homology with cDNA sequences to maintain the continuity of exonic sequences. Homologous exons could also be taken into account on the basis of evolutionary conservacy when other organism's sequence except human, would be used for blasting. Any discontinuity in the exonic sequences denote possible gaps in between two BACs or two sequences.
In this way a large number of BACs could be arranged. Subsequently gaps could be traced and filled by further hybridizations and sequencing.
Human genome sequencing is in its second phase. Two smallest chromosome 21, 22 and the first draft of whole human genome sequences are already finished [1,2,3]. Whole genome sequencing of mouse and rats are in progress. It is expected that a number of other organism's (other primates, dog, cats, etc) genome sequences will be initiated. A large amount of human genome sequences are already available in the database but these sequences are not ordered and assembled. In many cases, these sequences are shorter sequences (ranging from 10kb to 100kb) and are separated by "NNNNNN". From this stage of sequencing efforts, complete and matured finished sequences, ideally without any gaps, are expected. Two principle methods of genome sequencing are currently being employed -systemic sequencing of chromosomal integrated BACS  and shotgun sequencing followed by assembling of these sequences in right order into the chromosome . By systematic sequencing approach, ideally all BACs should be ordered and integrated. Given the enormous tasks for a genome sequencing, a large number of BACs are not ordered and overlapped, thus gaps in finished sequences remain. By shotgun sequencing method, ordering and assembling these sequences are the only concern. Even after generating raw data (after completion of first draft) by both of these methods, properly ordered, finished available sequences are expected to take another 2 years . The ordering of nonoverlapping BACs sequences and assembling these sequences into a whole chromosome are a major concern . A "BAC selection parking strategy" was suggested to minimize the cost and to order BACs after sequencing . This employs minimal 10% overlapping sequences in the BAC ends and subsequent walking. Moreover, tracing the extent of gaps and filling of these gaps are important to integrate these sequences into a chromosome. We propose a gene hunting based method for ordering these BACs sequences, tracing and filling these gaps to obtain a proper integration of these sequences into chromosomes.
This could be achieved by a simple method by three steps. 1) Selection of large cDNAs in the database (from lower organisms to human). There are no shortages of long cDNAs and plenty of (more than 83000 complete cDNAs from all organisms are already in the GeneBank) are being accumulated. A large number of more than 5kb long cDNAs ((only KIAA  group represents more than 100 cDNA over 4kb) are in the database. 2) Blasting with these large cDNAs to the unordered human (or any other organism's) raw genomic sequences. These blasts give homology with corresponding exonic sequences. 3) Since blasts are to be done with long cDNA or complete gene, two or more DNA sequences (coming from BACs) could be ordered based on continuity of exonic sequences. A few examples of proper ordering are shown in table 1. The corresponding cDNAs are blasted with human "high throughput genome sequences" (htgs). BAC sequences are ordered on the basis of evolutionary conservacy of the exonic sequences.
The most important information about the position and the extent of gap could be obtained from the ordered BACs. In the above ordering (table1), exonic gaps (missing exon(s)s or the part of exons which does not fit with the continuous cDNA sequences; denoted by * in the table 1) in KIAA0535 (650-720), RATMIBP1 (368-615) and in RalGPS1A (484-565, 875-1015) are 70bp, 247bp, 81bp and 140bp respectively. Even if large intron(s) lying in between two consecutive BACs in these cases, it is not likely that the intron size could be more than 50kb sequences. Unknown BACs could be traced from BAC library by hybridization with corresponding partial cDNA and further sequencing of the selected BACs could fill gaps.
These blasts could also be informative to order a number of BACs sequences carrying homologous genes. When blasted with single cDNA, mouse lrd (left right dynein) gene arranges BACs sequences carrying two different homologous dynein heavy chain genes in chromosome 17 (table 2). Although, the power of the identity/similarity decreases due to decreasing homology with the lrd but these should not interfere to deduce the exonic continuity and subsequently ordering of consecutive BACs carrying homologous genes.
These blasts have been done with the available genomic DNA (BACs sequences) deposited from all the chromosomes (htgs). When these blasts could be done with BACs DNA sequences from single chromosome or part of a chromosome that most genome center does, results should be robust. Over 30000  genes are already assigned into chromosome by radiation hybrid and assignments of more genes into chromosomes are being rapidly accumulated. A large number of complete cDNAs have been characterized and are being characterized. These mapped genes are immensely helpful for chromosome specific blasting. Those shorter sequences (less than 100kb) arising fromsequencing initially could be ordered by this way and finally assembled by simple blast or by "BAC end Power Blast"
By this way all unordered BACs cannot be ordered since it depends on continuous or overlapping exonic sequences. Nevertheless, this is an efficient way to order a large number of genes carrying BACs sequences. Chromosome 21 and chromosome 22 have 10.5Mb and 3Mb sequences geneless [1,2] region. Even if one third of the human genome is geneless (but that is not the actual case ), total coding region spans through 2 billion of 3 billion bp and lies within 10000 to 15000 BACs (average BAC size is 150kb to 200kb) approximately. Also blasting with large cDNA would be useful when difficulties or ambiguities arise during arranging two or more BAC DNA sequences of unknown location. Apart from human, a many organism's genome sequencing has been initiated and draft sequence is being generated. A hybrid procedure of both systematic sequencing  and shotgun sequencing  has been recently conceded for sequencing other genomes. Complete or partial genes (long cDNAs but not ESTs) could be useful to order two sequences or trace gaps. It is one of a simple way to integrate BACs sequences into chromosomes or to order random short sequences and to obtain an accurate, properly finished sequence of each chromosome.
Dunham I, et al: The DNA sequence of human chromosome 22. Nature. 1999, 402: 489-495. 10.1038/990031.
Hattori M, et al: The DNA sequence of human chromosome 21. Nature. 2000, 405: 311-319. 10.1038/35012518.
Lander E, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.
Marshal EA: A strategy for sequencing the genome 5 years early. Science. 1995, 267: 783-784.
Fleishmann RD., et al: Whole genome random sequencing and assembly of Haemophylus influenzae. Science. 1995, 269: 496-511.
Pennisi E: Finally the book of life and instructions for navigating it. Science. 2000, 288: 2304-2307. 10.1126/science.288.5475.2304.
Kuehl PM, Weisemann JM, Touchman JW, Green ED, Boguski MS: An effective approach for analyzing "prefinished" genomic sequence data. Genome Res. 1999, 9: 189-194.
Roach JC, Siegel AF, Ench GD, Trask B, Hood L: Gaps in the Human Genome Project. Nature. 1999, 401: 843-845. 10.1038/44684.
Nagase T., et al: Prediction of the coding sequences of unidentified human genes. XX. The complete sequences of 100 new cDNA clones from brain which code for large proteins in vitro. DNA Res. 1997, 4: 141-150.
Deloukas P., et al: A physical map of 30,000 human genes. Science. 1998, 282: 744-746. 10.1126/science.282.5389.744.
We thank Prof. Stylianos Antonarakis for reading and critical comments on this manuscript.