- Open Access
Reinvestigation of the Saccharomyces cerevisiae genome annotation by comparison to the genome of a related fungus: Ashbya gossypii
Genome Biologyvolume 4, Article number: R45 (2003)
The recently sequenced genome of the filamentous fungus Ashbya gossypii revealed remarkable similarities to that of the budding yeast Saccharomyces cerevisiae both at the level of homology and synteny (conservation of gene order). Thus, it became possible to reinvestigate the S. cerevisiae genome in the syntenic regions leading to an improved annotation.
We have identified 23 novel S. cerevisiae open reading frames (ORFs) as syntenic homologs of A. gossypii genes; for all but one, homologs are present in other eukaryotes including humans. Other comparisons identified 13 overlooked introns and suggested 69 potential sequence corrections resulting in ORF extensions or ORF fusions with improved homology to the syntenic A. gossypii homologs. Of the proposed corrections, 25 were tested and confirmed by resequencing. In addition, homologs of nearly 1,000 S. cerevisiae ORFs, presently annotated as hypothetical, were found in A. gossypii at syntenic positions and can therefore be considered as authentic genes. Finally, we suggest that over 400 S. cerevisiae ORFs that overlap other ORFs in S. cerevisiae and for which no homolog can be detected in A. gossypii should be regarded as spurious.
Although, the S. cerevisiae genome is rightly considered as one of the most accurately sequenced and annotated eukaryotic genomes, we have shown that it still benefits substantially from comparison to the completed sequence and syntenic gene map of A. gossypii, an evolutionarily related fungus. This type of approach will strongly support the annotation of more complex genomes such as the human and murine genomes.
A major breakthrough in the field of genomics came with the publication of the 13 Mb genome of the budding yeast Saccharomyces cerevisiae , which was the first eukaryotic genome to be fully sequenced and annotated. Since then, DNA sequencing has developed with an increasing speed, and sequences of much larger genomes, such as those of Caenorhabditis elegans , Drosophila melanogaster , Arabidopsis thaliana Homo sapiens [5, 6], Anopheles gambiae  and Mus musculus  have been published. However, increased sequencing capacity was not matched by a corresponding development in annotation and the gene annotation process is now the rate-limiting step in whole-genome sequencing projects. Despite progress in gene prediction programs, comparisons to expressed sequence tag (EST) databases and to genomic sequences, preferably of related organisms, is still the most favored approach to the annotation of complex genomes.
The original annotation of the S. cerevisiae genome was especially challenging because, at the time of its completion , only limited genomic sequence information from other eukaryotes was available. Despite the functional characterization of a large number of orphan open reading frames (ORFs) and several efforts to re-evaluate the sequence at the gene level or for an entire chromosome [10, 11], a significant number of uncertainties still remain. It is, for example, not known whether all protein-coding genes have been identified and which of the close to 2,000 genes annotated as hypothetical represent real genes. A careful comparison to a related genome should help clarify several of these issues.
The recently completed genome sequence of the filamentous ascomycete Ashbya gossypii revealed an unexpected high degree of gene homology and gene order conservation with S. cerevisiae (F.S.D., S.V., S.B., A.L., K.G., C. Mohr, S. Steiner, P. Luedi, T.G. and P.P., unpublished work). The two species diverged more than 100 million years ago, and both genomes differ substantially in their GC content (38.3% in S. cerevisiae and 51.9% in A. gossypii). However, 95% of the 4,700 A. gossypii protein-coding genes were found to have a homolog in S. cerevisiae and 90% of these homologous genes map at syntenic positions. Despite these striking genomic similarities, the average conservation at the DNA level is 55% in coding regions but drops to 33% in noncoding regions. Thus, significant sequence similarities are restricted to coding regions. Altogether, these findings open up the possibility of a whole-genome reinvestigation of the S. cerevisiae annotation.
We carried out an extensive search for homology at the amino-acid level between A. gossypii coding regions and S. cerevisiae 'annotation-free' regions: stretches of sequence bearing no annotated genomic features such as ORFs, RNA genes, or transposable elements. Focusing on syntenic regions, we identified a total of 95 inconsistencies, suggesting the following four types of changes in the S. cerevisiae annotation: novel genes, novel introns, potential ORF extensions, and neighboring ORF fusions. Furthermore, we provide evidence that information from the complete A. gossypii genome is also a major resource for recognizing real genes among the numerous S. cerevisiae hypothetical ORFs.
Results and discussion
We searched for homology at the amino-acid level between annotated A. gossypii coding regions and S. cerevisiae 'annotation-free regions'. As a result, we identified 95 regions in the S. cerevisiae genome, which had not been annotated as protein coding, that showed both homology and synteny to A. gossypii genomic sequences. In this context, synteny refers to a relaxed synteny (loose synteny), which results from several hundred genomic rearrangements in the A. gossypii and S. cerevisiae lineages and from frequent loss of one of the two gene copies (twin genes) in S. cerevisiae after the proposed doubling of the genome [12, 13]. As a result, all remaining duplicated genes in S. cerevisiae have a single homolog in A. gossypii (F.S.D., S.V., S.B., A.L., K.G., C. Mohr, S. Steiner, P. Luedi, T.G. and P.P., unpublished work). On close inspection of these 95 S. cerevisiae syntenic loci, we found evidence for novel ORFs, and for substantial boundary changes of annotated ORFs. Figure 1 outlines the categories of changes suggested by this comparative genomics approach.
We first present data supporting novel protein-coding genes and provide detailed analysis of the different types of boundary changes of annotated ORFs due to novel exons, 5'- or 3'-end extension, or even fusion of adjacent ORFs. Second, we will focus on the validation of the approximately 2,000 hypothetical ORFs. We will present evidence that 50% of these hypothetical ORFs are real, and provide arguments to consider several hundred as probably spurious.
In 23 annotation-free regions, we discovered homology to syntenic small A. gossypii ORFs as outlined in Figure 1a and summarized in Table 1. These presumptive novel S. cerevisiae ORFs are 52 to 134 codons long. Twenty have a size below 100 codons, the arbitrary cut-off for small and nonhomologous yeast ORF annotation, several contain an intron, and one contains two introns. The short length and the presence of introns explain why these ORFs remained so far undiscovered. An additional example of a novel S. cerevisiae ORF identification by comparison to the A. gossypii genome was recently published .
We carried out homology searches for all novel ORFs against the available fungal databases. This analysis revealed that all but one of the novel ORFs are present in hemiascomycetes and for 15 ORFs, homologs were found in at least two of the following databases: hemiascomycetes, Candida albicans, Schizosaccharomyces pombe and Neurospora crassa (Table 1). This suggests that they represent conserved fungal proteins. For two genes, YMR194C-B and YNL024C-A, we identified homologs in higher eukaryotes, including mouse and human. The conservation in other species, and particularly their syntenic positions in A. gossypii, strongly support the authenticity of these novel genes.
We screened the 23 protein sequences for the presence of known domains but did not find any significant hits. These novel S. cerevisiae ORFs were not deleted by the yeast gene deletion consortium  but the A. gossypii homologs of YIL156W-B, YJL127C-B, YMR194C-B and YNL138W-A have been deleted (K.G. and T.G., unpublished data). One deletion is lethal; the others did not exhibit any apparent phenotype under normal growth conditions. Recently, two of the novel ORFs - YPL096C-A (ERI1, ER-associated Ras Inhibitor 1) and YKL138C-A (HSK3, Helper of Ask1) - were added to the Saccharomyces Genome Database (SGD) as reserved gene names, indicating unpublished functional data, and one novel ORF, YPR036W-A, was shown to be expressed in response to drug treatment .
A similar approach based on the so-called Génolevures project, a partial shotgun sequencing of 13 hemiascomycete genomes , suggested the presence of 50 overlooked ORFs in S. cerevisiae, distinct from the set of 23 described in this paper. These 50 ORFs were recently incorporated in the SGD. Arguing that the species under consideration were too closely related, Wood et al.  recommended further investigations before considering these 50 novel ORFs as real. Having the A. gossypii genome to hand allowed us to evaluate the authenticity of these proposed ORFs. Indeed, we found 20 of the suggested novel ORFs at syntenic positions (see Additional data files). Similarly, the comparison between the S. pombe and the S. cerevisiae genome annotations identified three additional novel S. cerevisiae ORFs [18, 19], distinct from the 23 ORFs discussed above. All three correspond to syntenic homologs in A. gossypii, which confirms the assumption that they are real ORFs. More recently, 84 S. cerevisiae small ORFs, called smORFs, were identified on the basis of homology to a larger fungal database and experimental evidence for transcription products . Upon re-evaluation, we found that five smORFs correspond to novel ORFs described here and five others match sequences of ORF extensions as discussed below. Several smORFs correspond to RNA genes or match the opposite strand of previously annotated ORFs in both S. cerevisiae and A. gossypii and thus do not represent protein-coding genes. For the remaining smORFs, there were no homologs found in A. gossypii.
Novel introns and exons
Splicing rules and intron positions are generally conserved in A. gossypii and S. cerevisiae. On this basis we were able to identify 13 cases of probably overlooked introns in S. cerevisiae, as schematically represented in Figure 1b. Splicing of the novel introns and fusion of the novel exon extend the S. cerevisiae ORFs up to 236 codons and lead in most cases to substantially increased similarity between homologs of the two species (data not shown). The ORFs under consideration, the overall size increases, and other supporting evidence are shown in Table 2, which summarizes all 72 ORF extensions. Perfect splice consensus sequences were found for only three genes, which explains the difficulty in recognizing these introns. Finally, for one gene, SEF1 (YBL066C), we propose a base-pair change in addition to an intron. We tested the authenticity of the proposed introns for YKR004C (ECM9), YML017W (PSP2) and YOL048C using 5' rapid amplification of cDNA ends (5' RACE). In all three cases, the intron could be confirmed by sequencing the cDNA obtained (AY245791, AY245792, and AY245793). cDNA and genomic sequence alignment confirms that the intron of YKR004C is spliced at perfect consensus splice sequences and that both YML017W and YOL048C bear non-consensus splice sites with the respective acceptor/donor sites GTATGT--CACTAAC--CAG and GTAAGT--GACTAAC--TAG. In three cases of novel introns - YBL091C-A, YHR079C-A and YOL048C - splicing has already been proposed by either Blandin et al.  or Wood et al. .
In addition to overlooked introns, we identified one case of a potentially wrongly assigned 5' splice site in CPT1 (YNL130C), which codes for the sn-1,2-diacyglycerol cholinephosphotransferase . The current annotation proposes that CPT1 would be spliced at a mismatched splice acceptor sequence. However, comparison with the A. gossypii homolog strongly suggests an intron of 92 base-pairs (bp), instead of 441 bp, with perfect consensus splice sequences. This would result in a protein of 407 amino acids with increased similarity to its A. gossypii homolog. This suggestion is supported by comparison with other fungal species, for example C. albicans and S. pombe (Table 2). Finally, a size of 407 amino acids for this enzyme was already proposed in the first publication describing it .
A special case, not listed in Table 2, is the intron in STO1 (YMR125W), a gene that encodes the large subunit of the nuclear cap-binding protein complex, a transcriptional activator of glycolytic genes. The comparison with A. gossypii cannot distinguish between two alternatives: presence or absence of an intron, as shown in Figure 2. The S. cerevisiae sequence currently available at SGD is annotated with an intron. Although we noticed the presence of an equivalent intron in A. gossypii, homology is conserved between the two non-spliced forms of these genes in the two organisms as well. Therefore, it may be possible that the STO1 locus in both organisms encodes two proteins with differently charged amino ends.
5' and 3'ORF extensions
In 35 cases, it was possible to extend the boundaries of ORFs into annotation-free regions by artificially introducing single base-pair changes in the S. cerevisiae genomic sequence. These changes eliminated presumptive frameshifts or premature stop codons, as outlined in Figure 1c. The ORFs affected, the increase in ORF size (more than 70 amino acids for 50% of the ORFs), and other supporting evidence are listed as part of Table 2. In 15 cases, we resequenced the region of the proposed change and confirmed the sequencing error. Finally, homology searches also supported the proposed sequence corrections. All regions of suggested change were inspected using BLAST searches against the Génolevures, C. albicans, S. pombe and N. crassa sequence data. In more than 70% of cases, we found homologous sequences in two or more databases that matched the A. gossypii annotation (Table 2).
A special case of ORF extension concerns VPS5 (YOR069W), for which we propose an annotation rather than a sequence correction. The A. gossypii homolog is much longer and consideration of a further upstream start codon for VPS5 would result in an 5' extension 364 codons long with strongly enhanced homology to the A. gossypii homolog.
Another 22 proposed modifications resulted in the fusion of two previously distinct S. cerevisiae ORFs, as outlined in Figure 1d. A compilation of the A. gossypii ORFs, the fused S. cerevisiae ORFs, and their sizes is given in Table 2 and Table 3. As for the ORF extensions, we obtained supporting evidence for the validity of these fusions from database searches. For 17 of the proposed fusions, we found homologs of similar sizes in two or more fungal databases. Moreover, 10 of the ORF fusions had already been reported but not yet been included in databases, and seven are supported by a much better alignment to a duplicated copy in S. cerevisiae.
It should be pointed out that S. cerevisiae carries pseudogenes  and that confirmed pseudogenes may have homology over their entire length to single A. gossypii ORFs; see examples in Table 3 for three pseudogenes annotated as YER039C/YER039C-A, YLL017W/YLL016W, and YOL163W/YOL162W. Consequently, discrepancies observed between ORFs of the two species may either result from sequencing errors or may represent real pseudogenes. Therefore, we experimentally investigated nine of the proposed ORF fusions by resequencing the respective genomic regions in S. cerevisiae strain S288C, the reference strain of the yeast genome sequencing project . In eight cases, a sequencing error was found, confirming the fusion of eight pairs of neighboring genes (Table 2 and Table 3). On the other hand, resequencing also revealed that YJL107W/YJL108W is a novel pseudogene that bears a single point mutation in S288C.
In addition, programmed ribosomal frameshifting has been demonstrated in S. cerevisiae [23, 24] and this might explain some of the observed differences between S. cerevisiae and A. gossypii genomic sequences. Therefore, resequencing all the questioned regions in S. cerevisiae would be needed to be able to discriminate between sequencing errors, pseudogenes and functional frameshifts.
Gene extensions revealing additional functional domains
We analyzed the presence of known functional domains in S. cerevisiae proteins with or without the proposed changes. Most of these extensions did not generate additional domains in the proteins. YKL033W-A, however, could be extended at the 3' end by 176 codons, adding a HAD (haloacid dehalogenase) domain (InterPro: PF00702) and suggesting that this protein of previously unknown function may have a role in the assimilation of halogenated compounds. Similarly, YMR269W was described as a hypothetical protein of 164 amino acids. We propose an amino-terminal extension of 69 amino acids, which would generate a protein of 211 amino acids with 11% greater similarity to its A. gossypii homolog (Figure 3a,b). This proposal was confirmed by resequencing. Domain analysis revealed the presence of a putative RNA-binding domain (D111: PS50174) in both the extended S. cerevisiae and the annotated A. gossypii proteins (Figure 3c). YMR269W might therefore have a role in RNA-mediated cellular processes such as splicing, transcription, or translation. Indeed, YMR269W was recently found to interact with the translation initiation factor GCN3 (YKR026C) in a whole-genome two-hybrid screen , which also points to a role in translation. Finally, evaluation of expression data from over 100 genome-scale experiments showed that YMR269W is regulated in a very similar manner as genes involved in protein synthesis . It thus appears very likely that the 'extended' version of YMR269W is involved in protein synthesis.
Confirmation of hypothetical ORFs as real ORFs
In 1996 the yeast genome sequencing consortium faced the difficulty of annotating the first eukaryotic genome. Many potential ORFs in the newly sequenced genome did not have homology with entries in the existing databases, and discrimination between 'real' and 'chance' ORFs was often not possible. Novel S. cerevisiae genes lacking homology to any database were annotated if they were at least 100 codons long. The use of this arbitrary cut-off permitted the annotation of most of the 'real' genes but also led to the annotation of many questionable ORFs. Since then, a substantial number of these ORFs have been functionally characterized. However, many are still annotated as hypothetical ORFs because of a lack of functional data. The identification of homologs of these ORFs in related organisms can be taken as strong evidence for their biological significance. Because A. gossypii shares as much as 95% of its genes with S. cerevisiae (90% being in synteny) it is an excellent organism to evaluate the authenticity of yeast hypothetical ORFs.
Currently, 1,885 S. cerevisiae ORFs are classified as hypothetical ORFs in the Saccharomyces Genome Database (SGD). We compared these genes to the A. gossypii genome annotation and identified a homolog in A. gossypii for 1,041 of them. Most important, 999 of these (96%) share both homology and synteny and can therefore be considered to be orthologs. The full list of hypothetical ORFs that should be regarded as real ORFs because of their homology and synteny with A. gossypii is available as an Additional data file. The Munich Information Center for Protein Sequences (MIPS), the other publicly accessible S. cerevisiae genome database, lists 988 of the S. cerevisiae ORFs classified as hypothetical or questionable (with questionable referring to hypothetical ORFs overlapping functionally characterized ORFs). We found homologs for 279 of these ORFs at syntenic positions in the genome of A. gossypii and all belong to the group of ORFs identified above as real among the 1,885 hypothetical ORFs at SGD. This comparison therefore provides strong evidence for the authenticity of a substantial part of the ORFs annotated as hypothetical in both SGD and MIPS.
Spurious ORFs among S. cerevisiaehypothetical ORFs
We assume it unlikely that all the remaining 844 hypothetical S. cerevisiae ORFs in SGD encode proteins, as only 10% of the known functional S. cerevisiae genes have no homolog in A. gossypii (F.S.D., S.V., S.B., A.L., K.G., C. Mohr, S. Steiner, P. Luedi, T.G. and P.P., unpublished work). They cannot, however, be directly investigated using this comparative approach owing to the absence of homologous genes in A. gossypii. Nevertheless, indirect evidence of the dubiousness of a subgroup of the ORFs absent in A. gossypii can be obtained by taking into consideration that they overlap other ORFs. The inspection of all overlapping pairs among annotated S. cerevisiae ORFs (based on our revised version of the S. cerevisiae ORF boundaries) reveals that, in the vast majority of the cases, one of the two ORFs belongs to the group of hypothetical ORFs lacking a syntenic homolog in A. gossypii. Furthermore, although there is some experimental evidence that, in rare cases, functional fungal genes overlap at the 3' ends of their ORFs; in other words, the opposite strand of one ORF acts as transcription terminator for the other ORF and vice versa , there is no experimental demonstration that a sequence within a functional ORF can act as promoter for another gene. We used these rules, in addition to the absence or presence of homologs in A. gossypii, to validate S. cerevisiae overlapping ORFs. We found three different categories among the 419 pairs of overlaps as schematically shown in Figure 4. For only seven pairs, A. gossypii carries homologs of both ORFs. Two pairs of homologs overlap in A. gossypii; the other five do not. Two cases of 5' end overlapping ORF pairs are probably explained by the assignment of the wrong start codon, and the remaining cases relate to overlapping 3' ends, which supports the hypothesis that ORF overlaps are rare in S. cerevisiae and that they involve 3' ends of ORFs (see Figure 4 legend).
For 367 pairs, one ORF homolog was found in A. gossypii, the other not and we propose that the later ones are very likely to be spurious (see Figure 4, class B). These ORFs are listed in the additional data files with information about their present functional annotation, their sizes, and the type of overlap. For 66% of the pairs in this class, one or both presumptive promoter regions overlap an ORF sequence. In the remaining 34%, both terminator regions overlap ORF sequences. These latter cases should be viewed with caution as a very small percentage of them might turn out to be real. Indeed, in two cases marked in the additional data files, published data confirm the authenticity of a suggested spurious ORF. It should be noted that some of the proposed spurious ORFs are reported in very close relative of S. cerevisiae such as S. bayanus. However, the similarity is often restricted to the overlapping regions and is likely to result from the transfer of conservation from the real coding region to the other frames or strand. As A. gossypii is more distantly related, such homology can be found, though seldom, but does not match ORFs owing to the presence of STOP codons. Finally, in the remaining 45 pairs, for two ORFs we could find no homolog in A. gossypii and the criteria applied above cannot be used here. However, 36 ORFs could be considered as likely to be spurious as they overlap ORFs with described function or with a size of at least 500 codons (Figure 4, class D, and see Additional data files).
In summary, comparison of pairs of overlapping S. cerevisiae ORFs with the A. gossypii genome suggests that probably 403 of the remaining 844 hypothetical ORFs should be considered with care as they are likely to be spurious. Wood et al.  used ORF overlaps as one criterion to disregard 371 S. cerevisiae genes annotated as hypothetical, 289 of which are spurious according to the criteria applied above. Our analysis leaves about 450 hypothetical ORFs for which no information is currently available to categorize them as likely to be real or spurious. Additional evidence from similar analyses with other yeast species will be necessary to resolve these problems. Therefore, projects such as the current Saccharomyces Genome Project  will hopefully allow for a final S. cerevisiae genome annotation, seven years after completion of the genome sequence.
We have demonstrated the power of comparative genomics for the annotation of two completely sequenced fungal genomes. Whole-genome comparison guided the identification of novel ORFs, improved gene annotations, revealed sequencing errors, and helped to distinguish between real and spurious ORFs, thereby enhancing our view of the S. cerevisiae genome. As a consequence, these results will also contribute to the validity of genome-scale experiments, such as gene-expression profiling, an area where accurate gene annotation is crucial. The forthcoming availability of more yeast species genomes will, in an analogous way, drastically improve speed and accuracy of genome annotation of S. cerevisiae and the newly sequenced species. The method described here is straightforward for lower eukaryotes and should be applicable to any two closely related genomes of any complexity.
Materials and methods
S. cerevisiae AB972 (S288C) strain was used for resequencing and 5' RACE-based intron verification.
5' RACE-based verification of proposed introns in S. cerevisiae
Reverse transcription and 5' RACE was done using SMART RACE cDNA amplification system (BD Bioscience Clontech). Gene-specific primer (GSP) sequences were selected approximately 200 bp downstream of the putative introns for YKR004C, YML017W, and YOL048C. The amplified cDNA fragments were purified from the gel, cloned in the TOPO-TA cloning vector (Invitrogen) and sequenced on an ABI Prism 310 sequencer (Applied Biosystems). The sequences are available in GenBank (accession numbers AY245791, AY245792, and AY245793).
Resequencing of S. cerevisiaegenomic regions
The following 25 S. cerevisiae genomic regions were resequenced: YAL013W (AY260888), YAR044W/YAR042W (AY260892), YBL104C (AY260889), YBR074W/YBR075W (AY260891), YBR157W (AY260879), YCL008C (AY260880), YJL012C/YJL012C-A (AY227894), YJL016W/YJL017W (AY260898), YJL019W/YJL018W, YJL020C/YJL021C, YJL108C/YJL107C (AY227895), YJL159W (AY260881), YJL160C (AY260893), YJL178C (AY260894), YJR013W (AY260895), YKL033W-A (AY260896), YKL199C/YKL198C, YKL207W (AY260882), YKR056W (AY260897), YKR100C (AY260883), YLR205C (AY260884), YLR389C (AY260885), YMR269W (AY22789), YOR298C-A (AY260886), and YPR089W/YPR090W (AY260887). Primers flanking the region of the putative frameshift or premature stop mutation were selected to be unique within the yeast genome, and to have similar melting points. PCR was carried out using standard protocols on the AB972 (S288C) genomic DNA. PCR products were confirmed by agarose gel electrophoresis and sequencing was carried out on an ABI 310 sequencer using big dye chemistry and protocols from ABI. Sequence assembly was performed using the phred/phrap/consed analysis package [29–31]. Sequence corrections for YJL019W/YJL018W, YJL020C/YJL021C and YKL199C/YKL198C were recently corrected in SGD and were, therefore, not submitted to GenBank.
Sequence databases and sequence analysis
S. cerevisiae, S. pombe and hemiascomycete genomic sequence information was retrieved from GenBank at the National Center for Biotechnology Information (NCBI) [32, 33]. The S. cerevisiae genome was used as available at NCBI on 27 October, 2002. This release was submitted by the SGD [34, 35]. Sequence data from C. albicans and N. crassa were obtained from the Stanford Genome Technology Center  and from the Neurospora Sequencing Project, Whitehead Institute/MIT Center for Genome Research , respectively. Sequence analysis was carried out using the GCG Wisconsin Package (Accelrys), BLAST  and FASTA tools [39, 40]. Domain analysis was carried out using the InterProScan.pl algorithm from the European Institute of Bioinformatics [41, 42]. Functional classifications of S. cerevisiae ORFs were taken from the SGD  and MIPS .
Annotation of A. gossypiichromosomes
The 9 Mb genome was sequenced by combining three strategies: end-sequencing of chromosome-sorted plasmid and BAC clones, shotgun sequencing of sheared genomic DNA fragments, and extensive gap filling by primer walking, which resulted in an average accuracy of 99.8% (F.S.D., S.V., S.B., A.L., K.G., C. Mohr, S. Steiner, P. Luedi, T.G. and P.P., unpublished work). In the first round of annotation, all ORFs longer than 50 codons were searched using BLAST against the set of S. cerevisiae ORF translations available from SGD . A. gossypii ORFs with a hit lower or equal to E = 1e-2 were automatically annotated as S. cerevisiae homologs. This first draft of the A. gossypii genome annotation together with the BLAST results were re-evaluated case by case in a non-automatic procedure. A. gossypii ORFs sharing low or high homology with syntenic S. cerevisiae ORFs were kept. The synteny was independently assessed by two people. Inter-ORF regions were then compared with the six translation frames of the S. cerevisiae genome sequence, leading to the identification of potentially overlooked ORFs in S. cerevisiae and A. gossypii. In a final step, the remaining A. gossypii potential ORFs were searched against other databases and led to the annotation of A. gossypii genes with no homolog in S. cerevisiae.
Homology screening of all S. cerevisiaeinter-ORF regions
A. gossypii ORF translations were searched against a locally built yeast inter-ORF sequence database using BLAST 2.0. A cut-off threshold E-value of 1e-2 was used to filter the results. RNA genes were automatically filtered out and the remaining hits were manually checked for synteny. Regions of discrepancy were carefully checked in A. gossypii and in all cases matched good-quality consensus sequence. The current S. cerevisiae genome annotation release was re-annotated with proposed changes or novel sequence using the Artemis annotation tool  and the Sequin submission tool [32, 33].
Experimentally verified sequence corrections are available in GenBank under the accession numbers: AY260888, AY260892, AY260889, AY260891, AY260879, AY260880, AY227894, AY260898, AY227895, AY260881, AY260893, AY260894, AY260895, AY260896, AY260882, AY260897, AY260883, AY260884, AY260885, AY22789, AY260886, AY260887, AY245791, AY245792 and AY245793.
Additional data files
The following files are available: a list of the novel S. cerevisiae ORFs proposed by Blandin et al. and Wood et al. [17, 18] for which a syntenic homolog was found in the A. gossypii genome (Additional data file 1); a list of all S. cerevisiae hypothetical ORFs for which a homolog was found in the A. gossypii genome (Additional data file 2); a list of all S. cerevisiae hypothetical ORFs for which no homolog was found in the A. gossypii genome (Additional data file 3). A list of all class C overlaps together with gene sizes, gene functions and overlap directions can be found in Additional data file 4; these genes are suggested to be spurious based on our criteria. A list of class D overlaps: none of the two overlapping genes in S. cerevisiae has a homolog in A. gossypii can be found in Additional data file 5; the size and function classifications were used to predict spurious genes in some of these cases. A graphical display of the proposed S. cerevisiae annotation changes based on comparison with A. gossypii can be found in Additional data file 6. GenBank files for each of the S. cerevisiae chromosomes that take account of the proposed modifications (prior to confirmation of the sequence) can be found in Additional data file 7. GenBank files for each A. gossypii genomic locus used to infer annotation corrections in S. cerevisiae can be found in Additional data file 8.
Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, et al: Life with 6000 genes. Science. 1996, 274: 563-567. 10.1126/science.274.5287.546.
The C. elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998, 282: 2012-2018. 10.1126/science.282.5396.2012.
Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al: The genome sequence of Drosophila melanogaster. Science. 2000, 287: 2185-2195. 10.1126/science.287.5461.2185.
The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.
Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, Wincker P, Clark AG, Ribeiro JM, Wides R, et al: The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002, 298: 129-149. 10.1126/science.1076181.
Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.
Mewes HW, Albermann K, Bahr M, Frishman D, Gleissner A, Hani J, Heumann K, Kleine K, Maierl A, Oliver SG, et al: Overview of the yeast genome. Nature. 1997, 387: 7-65. 10.1038/42755.
Sequence updates at SGD. [http://genome-www.stanford.edu/Saccharomyces/sequenceupdates.shtml]
Chromosome III resequencing information at MIPS. [http://mips.gsf.de/cgi-bin/proj/yeast/THREE]
Philippsen P, Kleine K, Pohlmann R, Dusterhoft A, Hamberg K, Hegemann JH, Obermaier B, Urrestarazu LA, Aert R, Albermann K, et al: The nucleotide sequence of Saccharomyces cerevisiae chromosome XIV and its evolutionary implications. Nature. 1997, 387: 93-98.
Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997, 387: 708-713. 10.1038/42711.
Zhang Z, Dietrich FS: Verification of a new gene on Saccharomyces cerevisiae Chromosome III. Yeast. 2003, 20: 731-738. 10.1002/yea.996.
Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S, Lucau-Danila A, Anderson K, Andre B, et al: Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002, 418: 387-391. 10.1038/nature00935.
Miura F, Yada T, Nakai K, Sakaki Y, Ito T: Differential display analysis of mutants for the transcription factor Pdr1p regulating multidrug resistance in the budding yeast. FEBS Lett. 2001, 505: 103-108. 10.1016/S0014-5793(01)02792-2.
Blandin G, Durrens P, Tekaia F, Aigle M, Bolotin-Fukuhara M, Bon E, Casaregola S, de Montigny J, Gaillardin C, Lepingle A, et al: Genomic exploration of the hemiascomycetous yeasts: 4. The genome of Saccharomyces cerevisiae revisited. FEBS Lett. 2000, 487: 31-36. 10.1016/S0014-5793(00)02275-4.
Wood V, Rutherford KM, Ivens A, Rajandream MA, Barrell B: A reannotation of the Saccharomyces cerevisiae genome. Comp Funct Genomics. 2001, 2: 143-154. 10.1002/cfg.86.
Wood V, Gwilliam R, Rajandream MA, Lyne M, Lyne R, Stewart A, Sgouros J, Peat N, Hayles J, Baker S, et al: The genome sequence of Schizosaccharomyces pombe. Nature. 2002, 415: 871-880. 10.1038/nature724.
Kessler MM, Zeng Q, Hogan S, Cook R, Morales AJ, Cottarel G: Systematic discovery of new genes in the Saccharomyces cerevisiae genome. Genome Res. 2003, 13: 264-271. 10.1101/gr.232903.
Hjelmstad RH, Bell RM: The sn-1,2-diacylglycerol cholinephosphotransferase of Saccharomyces cerevisiae. Nucleotide sequence, transcriptional mapping, and gene product analysis of the CPT1 gene. J Biol Chem. 1990, 265: 1755-1764.
Harrison P, Kumar A, Lan N, Echols N, Snyder M, Gerstein M: A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution. J Mol Biol. 2002, 316: 409-419. 10.1006/jmbi.2001.5343.
Morris DK, Lundblad V: Programmed translational frameshifting in a gene required for yeast telomere replication. Curr Biol. 1997, 7: 969-976.
Asakura T, Sasaki T, Nagano F, Satoh A, Obaishi H, Nishioka H, Imamura H, Hotta K, Tanaka K, Nakanishi H, et al: Isolation and characterization of a novel actin filament-binding protein from Saccharomyces cerevisiae. Oncogene. 1998, 16: 121-130. 10.1038/sj.onc.1201487.
Uetz P, Giot L, Cagney G, Mansfield T, Judson RS, Knight JR, Lockshon D, Narayan VA, Srinivasan M, Pochart P, et al: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000, 403: 623-627. 10.1038/35001009.
Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, et al: Functional discovery via a compendium of expression profiles. Cell. 2000, 102: 109-126.
Gerads M, Ernst JF: Overlapping coding regions and transcriptional units of two essential chromosomal genes (CCT8, TRP1) in the fungal pathogen Candida albicans. Nucleic Acids Res. 1998, 26: 5061-5066. 10.1093/nar/26.22.5061.
Saccharomyces Genome Sequencing. [http://genome.wustl.edu/projects/yeast/]
Gordon D, Abajian C, Green P: Consed: a graphical tool for sequence finishing. Genome Res. 1998, 8: 195-202.
Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8: 186-194.
Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8: 175-185.
National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov]
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res. 2003, 31: 23-27. 10.1093/nar/gkg057.
Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, et al: SGD: Saccharomyces Genome Database. Nucleic Acids Res. 1998, 26: 73-79. 10.1093/nar/26.1.73.
Saccharomyces Genome Database. [http://www.yeastgenome.org]
Stanford Genome Technology Center, Candida albicans sequence. [http://www-sequence.stanford.edu/group/candida/download.html]
Whitehead Institute/MIT Center for Genome Research. [http://www-genome.wi.mit.edu]
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.
Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA. 1988, 85: 2444-2448.
Pearson WR: Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol. 1994, 25: 365-389. 10.1385/0-89603-276-0:365.
Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, et al: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 2001, 29: 37-40. 10.1093/nar/29.1.37.
Zdobnov EM, Apweiler R: InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics. 2001, 17: 847-848. 10.1093/bioinformatics/17.9.847.
Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 2002, 30: 31-34. 10.1093/nar/30.1.31.
Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B: Artemis: sequence visualization and annotation. Bioinformatics. 2000, 16: 944-945. 10.1093/bioinformatics/16.10.944.
Li Y, Kane T, Tipper C, Spatrick P, Jenness DD: Yeast mutants affecting possible quality control of plasma membrane proteins. Mol Cell Biol. 1999, 19: 3588-3599.
Robben J, Hertveldt K, Volckaert G: Revisiting the yeast chromosome VI DNA sequence reveals a correction merging YFL007w and YFL006w to a single ORF. Yeast. 2002, 19: 699-702. 10.1002/yea.870.
Belenkiy R, Haefele A, Eisen MB, Wohlrab H: The yeast mitochondrial transport proteins: new sequences and consensus residues, lack of direct relation between consensus residues and transmembrane helices, expression patterns of the transport protein genes, and protein-protein interactions with other proteins. Biochim Biophys Acta. 2000, 1467: 207-218. 10.1016/S0005-2736(00)00222-4.
Horowitz DS, Abelson J: A U5 small nuclear ribonucleoprotein particle protein involved only in the second step of pre-mRNA splicing in Saccharomyces cerevisiae. Mol Cell Biol. 1993, 13: 2959-2970.
Cooper KF, Mallory MJ, Egeland DB, Jarnik M, Strich R: Ama1p is a meiosis-specific regulator of the anaphase promoting complex/cyclosome in yeast. Proc Natl Acad Sci USA. 2000, 97: 14548-14553. 10.1073/pnas.250351297.
Cheng C, Mu J, Farkas I, Huang D, Goebl MG, Roach PJ: Requirement of the self-glucosylating initiator proteins Glg1p and Glg2p for glycogen accumulation in Saccharomyces cerevisiae. Mol Cell Biol. 1995, 15: 6632-6640.
We thank Philippe Luedi and Amy Gladfelter for supporting discussions, and Arndt Brachat and Mike Primig for critical reading of the manuscript. This work was supported by grants from the University of Basel and Duke University.