The approach to this task depends on the particular region of interest. If the region is of modest size (say a megabase or less) then you could begin by looking for an NCBI finished or unfinished sequence contig [5] that includes it. If you have sequence from the region, then use the human genome BLAST search at the NCBI [7], and if not, the NMV browser [22] can be used to identify the correct contig according to STS and known-gene content. If you have any doubts about an unfinished contig, examine the BAC sequence accession numbers used to create it and check their positions according to the IHGMC physical map [6,10]. Using the NMV browser you can view known genes and SNPs, but richer annotation is accessible using the UCSC HGB [4] or Ensembl [16] browsers. It is straightforward to identify the corresponding HGB and Ensembl region by searching with the names or accessions numbers of genes or markers within the NCBI contig. By collating the annotations from HGB and Ensembl you will have the most comprehensive publicly available view of the region. Typically, most interest is given to the protein-coding potential of the region.
Where a gene prediction is based on a full-length cDNA clone, there should be little room for error: prediction is reduced to the production of an alignment between the cDNA and genomic sequences. Even in these cases it is possible to end up with an incomplete gene structure, however, if the underlying genomic sequence assembly is gapped or misassembled. It may be helpful to compare the gene structures given by NMV [22] and the other two browsers [4,16], since the NCBI and UCSC genome assemblies presumably contain different gaps and misassemblies. The most unreliable gene predictions are those based on an ab initio prediction (for example by Genscan [17] in Ensembl [16]) supported by similarity to transcribed sequence. Unfortunately, although ab initio prediction programs are good at detecting the presence of a given gene, they do rather badly at accurately predicting its structure - real exons can be missed and spurious ones can be added. The UCSC HGB browser [4] offers a potential way to resolve problems with spurious additional exons in the form of the output from a second ab initio prediction program (Fgenesh). Given that different prediction algorithms produce different falsely predicted exons, you can improve the accuracy of a predicted gene by comparing the results of two algorithms and keeping only those exons predicted by both. In HGB [4] the output of the Fgenesh gene-prediction algorithm can be compared with Ensembl-predicted genes (usually based upon Genscan [17], as discussed above). An additional problem is that some genes can be artificially fragmented into more than one smaller gene by prediction algorithms. In Ensembl [16] this appears as two or more neighboring genes all sharing high similarity to the same sequences according to Ensembl 'supporting evidence' information for the genes.
Automatic annotation systems such as those presented in HGB [4] and Ensembl [16] can also have difficulties differentiating between a functional gene and a recent pseudogene. A recent pseudogene may be sufficiently intact to generate a convincing score within an ab initio algorithm and, by definition, will have strong similarity to at least one real gene in the database. Such pseudogenes may carry premature stop codons and appear as truncated versions of real genes - users of HGB [4] and Ensembl [16] beware. Both HGB and Ensembl give an indication of whether a predicted gene appears to be spliced or not by aligning mRNA or EST sequences included in the gene to genomic sequence. The presence of splicing is often a good indication that a predicted gene is functional. Once you have identified a predicted gene that appears convincing, it is a good idea to examine the sequences to which it is similar. Unfinished human genomic sequence can contain contaminating sequence from other organisms: for instance, if a predicted gene resembles a bacterial gene more closely than any vertebrate genes it is almost certainly within contaminated genomic sequence.
At the moment, the best indication of the presence of promoters or other non-coding regulatory elements is conservation of discrete islands of non-coding sequence between human and mouse sequences. Both Ensembl [16] and HGB [4] have incorporated such data as they emerge from the mouse-sequencing project, but only Ensembl provides downloadable mouse sequence data. At the moment, only HGB offers comparisons with a second organism: the pufferfish Tetraodon nigroviridis. Both Ensembl and the NCBI intend to make the annotated genomes of other organisms available via their browsers. In the meantime, one can browse genomic regions of mouse-human homology at a low resolution using the NCBI Human-Mouse Homology Map [24]. Alternatively, a more detailed view of the pattern of homology over a region can be obtained using the ingenious PipMaker program [25]. PipMaker aligns two sequences of DNA up to 2 Mb in length and produces a 'percent identity plot' (hence 'pip') showing, at a glance, the pattern of conservation over the region.
All three of the annotation browsers display SNPs within a region of interest, but Ensembl and HGB also show the positions of repetitive sequence, so it is possible to avoid SNPs within repeats for use in PCR-based assays. Unfortunately, there is no classification of SNPs on this basis, so the user has to plough through all SNPs, making sure the SNP does not overlap with the coordinates given for repetitive sequences. It is also worth remembering that the number of known SNPs is increasing so rapidly that more may have been deposited in dbSNP [20] and/or on The SNP Consortium (TSC) site [26] since the version of the annotation you are browsing was produced. It may therefore be worth searching dbSNP and TSC data with BAC or gene-sequence accession numbers from your region.