- Open Access
Bases and spaces: resources on the web for accessing the draft human genome
Genome Biologyvolume 1, Article number: reviews2001.1 (2000)
Much is expected of the draft human genome sequence, and yet there is no central resource to host the plethora of sequence and mapping information available. Consequently, finding the most useful and reliable human genome data and resources currently available on the web can be challenging, but is not impossible.
Nice press release, shame about the data
The entire sequence of the human genome is not expected to be finished for some time, and gaps are expected to persist into 2003 . In the meantime, the genome exists in 'draft' form: multiple segments of sequence in which we have high confidence, placed relative to one another by mapping information of lower confidence. Many biologists study particular regions of the genome, such as those involved in positional cloning of disease genes, and this type of work is greatly accelerated by having most of the sequence of the region of interest. The draft human genome now includes this information for most of the genome. Unfortunately, no single resource unites the available human genomic sequences with their locations and their gene content, but by combining the varied resources currently available it is possible to devise strategies that fully exploit the draft genome data. So what resources and information are available so far and where can we find them? Note that the databases and resources mentioned in this article and the corresponding URLs are listed in Table 1
Raw sequence data
As recently as 1996, the entire GenBank (Table 1) database contained around 0.65 Gb of DNA sequence; but the draft human genome sequence alone runs to more than 3.08 Gb. Most of the draft sequence is present in GenBank (Table 1) as unfinished, fragmentary BAC (bacterial artificial chromosome) sequences. These consist of a number of non-overlapping, arbitrarily ordered, fragments, or 'contigs', which have been artificially concatenated to produce a single sequence entry for each BAC. Typically, each contig within a BAC is separated from the next by a large number of bases, labeled 'N'. All unfinished BAC entries are subject to irregular updates until they are finished, and this might alter the number and size of the contigs they contain. The most straightforward web interface for retrieving BAC sequence (and various other types of data) is Entrez (Table 1) at the National Centre for Biotechnology Information (NCBI), which also includes substantial online documentation.
Most BAC sequence entries contain information about the BAC in the 'DEFINITION' field near the top of the Entrez display. For example, the DEFINITION field in the Entrez entry AP001002 (Table 1) contains the BAC clone name (678K21) and the cytogenetic band to which it has been localized (11q14). Many entries give much less annotation; for example, at present, Entrez entry AC007104 (Table 1) provides no clone name, nor even the clone library, and gives the location as simply 'chromosome 4'. I will discuss ways to find a more specific location for this clone below, but we can retrieve the clone name using a little-known feature of Entrez. Under the 'Display' pull-down menu simply select 'ASN.1' (which is a sequence format used internally at NCBI) and redisplay the entry. The sequence data are now unreadable, but near the top of the file is the clone name '301J10'. The same information is retrievable from the 'XML' and 'Graphics' display formats, but not under the default GenBank format.
A related site, the Human BAC Ends (Table 1) site at The Institute for Genomic Research (TIGR), provides access to more than 743,000 end sequences from 470,000 BAC clones. (Typically, end sequences consist of several hundred base pairs from the clone ends.) It is possible to search the sequences with either a clone name or a sequence of interest. As the unfinished BAC sequences in GenBank do not always contain the sequences from the BAC ends, the BAC end sequences may provide extra sequence data for a clone of interest. In addition, the end sequences can help to identify the fragments of unfinished BAC sequences that represent the ends of the clone. One caveat is that, as usual, the annotation of these sequences should be treated with a certain degree of caution, because clone ends have been known to be attributed to the wrong clone . Conveniently, the BAC end sequences at TIGR are provided with any repetitive sub-sequences masked (they are replaced with runs of the letter X). Repetitive sequences are a recurring problem in dealing with genomic sequence, particularly interspersed repeats (regions of very similar sequence descended from various classes of transposable elements) . Interspersed repeats often span hundreds or thousands of bases and so can appear as spurious overlaps between genomic sequence fragments. The excellent program RepeatMasker (Table 1) does a good job of masking both interspersed and simple repeats. Simple repeats are stretches of sequence made up of units consisting of one or more bases, which may be repeated hundreds of times. They can be used as genetic markers, for example, in disease association studies, so finding them and annotating them properly is an important task. The Sputnik (Table 1) program provides a fast and elegant method for annotating simple repeats, giving each repeat's location, classification (on the basis of repeat unit length - dinucleotides, trinucleotides and so on) and sequence.
Ideally, it would be desirable to retrieve the genomic sequence of a region of interest defined by the user, rather than multiple segments restricted to the size of BAC clones. A heroic, preliminary assembly of the draft genome sequence is available on the Working Draft Sequence (Table 1) site from David Haussler's group at the University of California, Santa Cruz (UCSC). Although this assembly contains over 200,000 gaps as well as some misassemblies and incorrectly ordered sequences, as it is updated with more sequence data it will become an important resource. The information is incorporated into the Entrez Homo sapiens genome view (Table 1) at NCBI, which is a graphical viewer designed to integrate sequence data with mapping information from various sources. Again, this NCBI interface will be a potent tool when more sequence data are available; it is already the best integration of data for finished chromosomes such as 21 and 22.
Expressed sequence data
Before the flood of genomic sequence from the Human Genome Project, full sequences were available for only a small proportion of human genes. Most human genes were represented only by expressed sequence tags (ESTs; fragments of mRNA sequences). Various efforts have been made to cluster overlapping EST sequences to give a longer representative sequence for each gene . The most comprehensive of these efforts is the human UniGene (Table 1) database at the NCBI, in which ESTs and mRNAs from GenBank (Table 1) that share overlapping subsequences have been grouped together into clusters. UniGene can be searched either with UniGene (Table 1) cluster accession numbers or with GenBank (Table 1) sequence accession numbers for ESTs or mRNAs. Clusters are linked to related mapping, sequence and expression data at the NCBI, and each cluster should represent a separate gene. As UniGene (Table 1) is automatically and regularly generated, it often contains errors. One serious problem is chimeric clusters, produced as a consequence of sequencing from chimeric clones (artifactual cDNAs that contain sequences from two different genes). TIGR also maintains a clustered EST database, called the Human Gene Index (HGI, Table 1), which has more stringent clustering criteria than UniGene (Table 1). Another human expressed sequence database, called STACK (Table 1), is held at the South African National Bioinformatics Institute (SANBI). In STACK (Table 1), expressed sequences are separated with respect to tissue of origin before clustering, and an attempt is made to represent differently spliced transcripts of the same gene. Unlike UniGene (Table 1), the STACK (Table 1), HGI (Table 1) and EuroGeneIndexes (Table 1) sites produce consensus sequences for clusters. Transcribed sequence databases are also available for species other than human; UniGene (Table 1) holds data for mouse, rat and zebrafish and there are TIGR Gene Indices (Table 1) for various other species. The EuroGeneIndexes (Table 1) at the European Bioinformatics Institute (EBI) also contain expressed sequence clusters for a number of non-human species. It is worth remembering that all expressed sequence databases will contain repetitive sequences because much of the sequence is from untranslated regions of genes.
The fragmentary human genome sequence is of little use without some idea of how the pieces fit together, so a map is needed that relates distinct landmarks, or sequence tagged sites (STSs), around the genome. Different types of mapping data provide maps of different resolutions. Genetic maps, based on the frequency of recombination events between STSs, are of relatively poor resolution - on the order of hundreds, or more often thousands, of kilobases. Physical mapping techniques can resolve STSs only tens of kilobases apart. In the early stages of the Human Genome Project, an important task was to construct a high-resolution genetic map of the genome, and the Genome Database (GDB, Table 1) was set up to curate such data. Genetic mapping data allowed genomic regions to be broadly defined, and efforts proceeded to physical mapping for finer distinctions. Various physical mapping projects have confirmed the physical order of genetic maps and extended genome maps to include further STSs and transcribed sequences. A physical map of the genome based on overlapping YAC (yeast artificial chromosomes) contigs was among the first to be published and the data are available from CEPH-Genethon (Table 1). One of the most important physical mapping techniques to emerge has been radiation hybrid (RH) mapping . RH maps are orderings of STSs based on assay scores of the STSs against a whole genome radiation hybrid 'panel'. Such panels consist of hybrid cell lines that contain different fragments of human genomic DNA. Each STS is assayed against each cell line to discover whether it is present in the genomic fragments particular to that cell line. The pattern of presence or absence in the cell lines making up a panel constitutes the retention pattern of the STS, and, by comparing STS retention patterns, the distance between STSs can be estimated. In this way, the TNG4 radiation hybrid map (Table 1) was generated at the Stanford Human Genome Centre (SHGC) and provides an average of 60 kb resolution across the genome. Such impressive estimates of resolution must be tempered, however, by the ambiguity that often accompanies RH-derived marker ordering. Comparisons of STS orders in sequenced regions of the genome with orders derived from RH maps suggest that RH map orders may be wrong up to 50% of the time . A consortium of RH mapping centres has produced a transcript map of the genome based on RH mapping data, named GeneMap'99 (Table 1), which is accessible at the NCBI. The STS content of a sequence of interest can be determined online using the electronic-PCR (e-PCR, Table 1) program at NCBI. This is a rapid sequence-search algorithm that searches your sequence for occurrences of the STS sequences in GenBank (Table 1).
An important new source of mapping data has become available with the release of the draft genome: the fingerprint analysis of BAC clones for the genome project at Washington University Genome Sequencing Centre (WUGSC). The human genome BAC map (Table 1) provides the highest resolution human mapping data yet made available and is likely to do so until publication of the full human genome. The overlaps between clones are calculated using the program FPC (Table 1) on the basis of clone restriction fragment patterns or fingerprints. The resulting contigs are estimated to cover 97-98% of the genome. The fingerprint analysis has also been extended to show the sequence accession numbers for those clones that have been sequenced forming a Human Accession Map (Table 1).
Genome sequence annotation
Once a region of the genome has been sequenced, the immediate concern is to identify the genes, if any, that are present. Broadly speaking, the computational annotation of genomic sequence proceeds by two methods: ab initio gene prediction, and detection of similarity. Strictly defined, ab initio prediction of genes relies on the presence of compositional biases in genomic sequence that are characteristic of exons. Similarity to known transcribed or protein sequences can be used as further evidence of the accuracy of an ab initio prediction. Many gene prediction programs combine these types of evidence and show considerable success in detecting genes [7,8]. Computational predictions must be treated with caution, however, before they have been confirmed at the bench.
The Ensembl (Table 1) database aims to provide a basic level of computational annotation for the draft genome. It localises BAC sequences in the genome according to a combination of mapping data and runs the sequences themselves through an 'analysis pipeline'. This pipeline consists of repeat masking the sequence, processing it with a gene prediction program called Genscan and then searching the predicted genes against sequence databases. Predicted genes that match known genes become Ensembl (Table 1) genes and are stored in the searchable Ensembl database. The Genome Channel (Table 1) is an analogous pipeline system that gives more detailed annotation, including CpG islands (areas of DNA that have a relatively high cytosine and guanine content), poly-adenylation sites and gene predictions from more than one gene prediction program. With rather more effort, it is possible to get very detailed annotation for a genomic sequence of interest through the NIX (Table 1) interface at the Human Genome Mapping Project Resource Centre (HGMP). Sequences submitted to NIX (Table 1) are processed by a variety of programs that detect repetitive regions, exons, tRNA genes, promoters, CpG islands, poly-adenylation sites and similarity to known proteins or transcribed sequences. The NIX interface is only available to registered HGMP users but it is possible for academic scientists to register without charge.
Putting the data to work
Although I am a computational biologist, most of my work involves collaboration with molecular biologists generating real data at the bench. I find that, after a hard day in their labs, people very rarely ask me to discuss the available draft genome resources. Their problems are invariably specific to a small number of genes or genomic regions. Where in the genome is gene X? What is in genomic region Y? These are the commonest questions, and I suggest generic approaches to answering them below. Even in the best-case scenario, however, where gene X is well characterized and already mapped, there will often be additional information to be extracted. What are the neighbouring genes and what is their relative order and orientation? What non-coding features (regulatory elements, pseudogenes and repetitive regions) lie in the vicinity?
Where in the genome is gene X?
As the draft sequence is estimated to cover more than 90% of the genome, the chances of finding part or all of gene X in unfinished BAC sequence are high. If the available sequence of gene X contains any non-coding DNA, it should first be masked using RepeatMasker (Table 1). A BLAST (Table 1) search of the sequence of gene X against the section of the database that contains the draft sequence is all that is necessary to find the relevant BACs. Using the NCBI Advanced BLAST (Table 1) site it is possible to limit the search to human draft genome sequence by selecting the 'htgs' database and 'Homo sapiens' in 'Advanced options' (Figure 1). Assuming the sequence quality is good for gene X, the BLAST (Table 1) output should show at least one segment of BAC sequence that is almost identical (greater than or equal to 98% identical is a reasonable rule of thumb) over a reasonable stretch of gene X. In the absence of any good match to a BAC sequence, the best option is to BLAST (Table 1) search gene X against human EST sequences (the 'human ests' database at NCBI) and search UniGene (Table 1) with matching EST accession numbers, because many UniGene (Table 1) clusters contain mapped ESTs. If gene X is found within a BAC sequence, the BAC should be repeat masked and submitted to e-PCR (Table 1) at NCBI, which will often provide one or more STSs. These STSs may be localized to a genetic or RH map using their accession numbers to search either the GDB (Table 1) or the Stanford RH maps. If the BAC that contains gene X does not contain any STSs, it can be used, after masking repeats, to search the htgs database again to discover overlapping BACs. Again, the intention is to find identical sequences, allowing for sequencing errors, and a reasonable rule-of-thumb measure of 'identical' is a stretch of greater than 1 kb showing greater than or equal to 98% identity in the BLAST output. Overlapping BACs may be annotated as coming from the same chromosome as gene X, or the first BAC and can be submitted to e-PCR (Table 1) and assigned a location. Supporting evidence for a collection of overlapping BACs can be obtained from the Human Accession Map (Table 1) and the Working Draft Sequence (Table 1) site at UCSC (Figure 2). If these two resources include the BACs but do not show them to overlap, one should be suspicious.
What is in genomic region Y?
Determining what is in genomic region Y involves a process similar to mapping gene X. The more sequence we start with from region Y the better. In the worst case, we might only have one STS sequence that is known to be from region Y but is not part of any known transcript. As in the section 'Where in the genome is gene X', after repeat-masking the STS we can use it to search the htgs database using NCBI Advanced BLAST (Table 1) for matching genomic sequence. This first stage is the most troublesome; as STSs are a only a few hundred bases long it is desirable to have some corroborating evidence to back up any apparent match to a BAC. This can come in the form of mapping data. Other markers or genes near the STS on genetic or RH maps would be expected to appear in the sequence of the apparently matching BAC or in BACs that overlap with it. Once we have reliably placed the starting STS sequence in a BAC, the task is to build up a contig of overlapping BACs around it, as in the section 'Where in the genome is gene X'. The Human Accession Map (Table 1) at WUGSC and Working Draft Sequence (Table 1) site at UCSC can be used as guides to choosing overlapping BACs that extend your contig furthest. Many BAC sequences, particularly those in earlier stages of sequencing, contain cloning vector sequences that can generate spurious BLAST (Table 1) matches. It is possible to use RepeatMasker (Table 1) to also mask vector sequences but this is not an option offered on the RepeatMasker (Table 1) web server. If a BAC sequence generates a large number of BLAST (Table 1) matches then the sequence should be searched against the entire sequence database ('nr' at NCBI BLAST - Table 1) to look for the presence of bacterial sequence. It is important to remember that the word contig is actually rather inappropriate here, because most BAC sequences are fragmented and incomplete. Generally BAC sequences are 150-200 kb long when complete, so it is possible to estimate roughly the amount of missing sequence. This process should eventually result in a list of BAC accession numbers that represents most of the sequence in region Y. The accession numbers that result can be used to search Ensembl (Table 1) to give the minimum number and identities if genes in region Y. More detailed analysis, including the identification of non-coding sequence features, can be carried out using NIX (Table 1). This strategy is probably only practical for investigating modestly sized regions of, say, less than 1 Mb; for larger regions, for example, a chromosomal band, it is easier to approach the process as an automated task - ask your friendly local bioinformaticist.
The web resources described here are those I find most useful, and are a 'snapshot' as of September 2000. They will, of course, be subject to change or updates as more information becomes available. Other people will have their own opinions and methods on how to use web resources for accessing the draft human genome - there is no substitute for experience.
Roach JC, Siegel AF, van den Engh G, Trask B, Hood L: Gaps in the human genome project. Nature . 1999, 401: 843-845-12642.
Zhao S, Malek J, Mahairas G, Fu L, Nierman W, Venter JC, Adams MD: Human BAC ends quality assessment and sequence analyses. Genomics. 2000, 63: 321-332. 10.1006/geno.1999.6082.
Smit AF: Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999, 9: 657-663. 10.1016/S0959-437X(99)00031-3.
Bouck J, Yu W, Gibbs R, Worley K: Comparison of gene indexing databases. Trends Genet. 1999, 15: 159-162. 10.1016/S0168-9525(99)01709-6.
Radiation hybrid mapping information. [http://compgen.rutgers.edu/rhmap/]
Agarwala R, Applegate DL, Maglott D, Schuler D, Schaffer GD: A fast and scalable radiation hybrid map construction and integration strategy. Genome Res . 2000, 10: 350-364. 10.1101/gr.10.3.350.
Burge CB, Karlin S: Finding the genes in genomic DNA. Curr Opin Struct Biol. 1998, 8: 346-354. 10.1016/S0959-440X(98)80069-9.
Semple C: Gene prediction: the end of the beginning. GenomeBiology. 2000, 1: reports4012.1-4012.3. 10.1186/gb-2000-1-2-reports4012. [http://genomebiology.com/2000/1/2/reports/4012]