A Drosophilafull-length cDNA resource

Stapleton, Mark; Carlson, Joe; Brokstein, Peter; Yu, Charles; Champe, Mark; George, Reed; Guarin, Hannibal; Kronmiller, Brent; Pacleb, Joanne; Park, Soo; Wan, Ken; Rubin, Gerald M; Celniker, Susan E

doi:10.1186/gb-2002-3-12-research0080

Research
Published: 23 December 2002

A Drosophilafull-length cDNA resource

Mark Stapleton^1,2,
Joe Carlson^1,2,
Peter Brokstein^1,2^nAff5,
Charles Yu^1,2,
Mark Champe^1,2,6,
Reed George^1,2,
Hannibal Guarin^1,2,
Brent Kronmiller^1,2,7,
Joanne Pacleb^1,2,
Soo Park^1,2,
Ken Wan^1,2,
Gerald M Rubin^1,3,4 &
…
Susan E Celniker^1,2

Genome Biology volume 3, Article number: research0080.1 (2002) Cite this article

28k Accesses
96 Citations
1 Altmetric
Metrics details

Abstract

Background

A collection of sequenced full-length cDNAs is an important resource both for functional genomics studies and for the determination of the intron-exon structure of genes. Providing this resource to the Drosophila melanogaster research community has been a long-term goal of the Berkeley Drosophila Genome Project. We have previously described the Drosophila Gene Collection (DGC), a set of putative full-length cDNAs that was produced by generating and analyzing over 250,000 expressed sequence tags (ESTs) derived from a variety of tissues and developmental stages.

Results

We have generated high-quality full-insert sequence for 8,921 clones in the DGC. We compared the sequence of these clones to the annotated Release 3 genomic sequence, and identified more than 5,300 cDNAs that contain a complete and accurate protein-coding sequence. This corresponds to at least one splice form for 40% of the predicted D. melanogaster genes. We also identified potential new cases of RNA editing.

Conclusions

We show that comparison of cDNA sequences to a high-quality annotated genomic sequence is an effective approach to identifying and eliminating defective clones from a cDNA collection and ensure its utility for experimentation. Clones were eliminated either because they carry single nucleotide discrepancies, which most probably result from reverse transcriptase errors, or because they are truncated and contain only part of the protein-coding sequence.

Background

One of the goals of the Berkeley Drosophila Genome Project is to define experimentally the transcribed portions of the genome by producing a collection of fully sequenced cDNAs. We have previously reported the construction of cDNA libraries from a variety of tissues and developmental stages; these libraries were used to generate over 250,000 expressed sequence tags (ESTs), corresponding to approximately 70% of the predicted protein-coding genes in the Drosophila melanogaster genome [1,2]. We used computational analysis of these ESTs to establish a collection of putative full-length cDNA clones, the Drosophila Gene Collection (DGC) [1,2]. Here, we describe the process by which we sequenced the full inserts of 8,921 cDNA clones from the DGC, describe the methods by which we assess each clone's likelihood of containing a complete and accurate protein-coding region, and illustrate how these data can be used to uncover additional cases of RNA editing. We have confirmed the identification of 5,375 cDNA clones that can be used with confidence for protein expression or genetic complementation.

Results and discussion

Sequencing strategy

Current approaches to full-insert sequencing of cDNA clones include concatenated cDNA sequencing [3], primer walking [4], and strategies using transposon insertion to create priming sites [5,6,7,8,9]. We adopted a cDNA sequencing strategy that relies on an in vitro transposon insertion system based on the MuA transposase, combined with primer walking (see Materials and methods for details).

The production of full-insert sequences from DGC cDNAs is summarized in Tables 1 and 2. For DGCr1, clones were sized before sequencing. Small clones (< 1.4 kilobases (kb)) were sequenced with custom primers and larger clones were sequenced using either mapped or unmapped transposon insertions. For DGCr2, clones were not sized and a set of unmapped transposon insertions was sequenced to generate an average of 5× sequence coverage. For both DGCr1 and r2, custom oligonucleotide primers designed using Autofinish [10] were used to bring the sequences to high quality. To date, we have completed sequencing 93% of the DGCr1 clone set and 80% of the DGCr2 clone set. The strategy used for sequencing DGCr1 clones appears to be more efficient, because on average they required fewer sequencing reads than DGCr2 clones. However, we were able to reduce cycle time and increase throughput using the shotgun strategy adopted for sequencing the DGCr2 clones. The average insert size of the 8,770 high-quality cDNA sequences that have been submitted to GenBank is 2 kb and they total 17.5 megabases (Mb) of sequence. The largest clone (SD01389) is 8.7 kb and is derived from a gene (CG10011) that encodes a 2,119-amino-acid ankyrin repeat-containing protein.

Table 1 Status of DGCr1 and DGCr2 clones

Full size table

Table 2 Finished clone statistics

Full size table

Evaluating the coding potential of each cDNA on the basis of its full-insert sequence

For many potential uses in proteomics and functional genomics [11,12,13], it is important to establish cDNA collections comprised only of cDNAs with complete and uncorrupted open reading frames (ORFs). To determine which of our sequenced clones meet this standard, we compared them to the annotated Release 3 genome sequence [14,15] using a combination of BLAST [16] and Sim4 [17] alignments (see Materials and methods for details).

We grouped the cDNAs into four categories (Table 3). The first category contains a total of 5,916 cDNA clones, or 68% of the sequenced clones. We are confident that 5,375 of these clones contain a complete and accurate ORF, as they precisely match the Release 3 predicted protein for the corresponding gene. An additional 541 clones are from the SD, GM and AT libraries, which were generated from fly strains that are not isogenic with the strain used to produce the genome sequence. The predicted ORFs from clones from these libraries were required to be identical in length to the Release 3 predicted protein with less than 2% amino-acid difference to be placed in this category. We cannot at present distinguish whether these differences result from strain polymorphisms or reverse transcriptase (RT) errors. However, our own internal estimates of RT errors (see below), based on the observed nucleotide substitution rate in cDNAs derived from the same strain as the genomic sequence, and published estimates of strain polymorphisms [18] lead us to believe that the majority of these changes are the result of strain polymorphism.

Table 3 cDNA analysis

Full size table

The second category represents 2,450 clones that are known to be compromised in one of a number of ways. The sequences of the largest class of compromised clones (1,314) align to the Release 3 predicted transcripts, but have nucleotide discrepancies that are most likely the result of errors generated by RT during library construction. These include missense and frameshift (+/-1 or +/-2 nucleotide difference) changes in the predicted ORF relative to the Release 3 predicted protein. Clones placed in this class can show up to 2% amino acid differences from the Release 3 peptide for isogenic libraries, and up to 4% difference for non-isogenic libraries. We estimated the error rate of an RNAseH-deficient RT (SuperScriptII, Invitrogen, Carlsbad, CA) by comparing the nucleotide sequence of cDNAs from isogenic libraries to the genomic sequence. For the GH, HL, LD, and LP libraries [1], we observed an error rate of 1 in 4,000; for the RE and RH libraries [2], we observed an error rate of 1 in 1,000. This difference is likely due to the different RT reaction conditions used in these two library construction protocols [1,2]. Although these numbers are higher than the 1 in 15,000 figure reported for SuperScriptII (Taurai Nenguke, personal communication), the in vitro assay used to obtain this error rate is based on assaying a single site for mutations that revert an amber codon.

The next largest class of compromised clones (768) consists of clones apparently truncated at their 5' ends, as judged by comparison to the Release 3 predicted ORFs of the corresponding genes. The 768 5'-short clones represent 757 distinct Release 3 annotated transcripts. For 151 of the 5'-short clones, 143 from DGCr1 and eight from DGCr2, we were able to identify clones with longer ORFs by additional EST sequencing. The remaining 606 clones are assumed to be 5' short because they do not possess a 5' in-frame stop codon and the corresponding annotated ORF in Release 3 extends further 5'. This class of clones represents approximately 9% of all finished clones, consistent with our original estimates that 80-94% of the DGC clones would contain the full ORF [1,2].

The remaining six classes of compromised clones consist of a total of 368 cDNAs (4% of all finished clones, see Table 3). Eighty-three clones encode ORFs that are truncated at their carboxy-termini and are most likely the result of priming from internal poly(A) tracts. Seventy-seven clones contain two unrelated ORFs and are almost certainly the result of two cDNAs being cloned into the same plasmid vector during library construction. Seventy clones contain ORFs of less than 50 amino acids. One hundred and eleven clones overlap a Release 3 predicted gene but are transcribed from the opposite strand from that of the mRNA encoding the Release 3 predicted protein and are considered anti-sense transcripts; a number of such cases were documented in the reannotation of the genome [15] and have been reported in many organisms [19]. Twenty-one clones correspond to transcripts of transposable elements on the basis of their sequence similarity to identified Drosophila transposons [20]. Finally, six clones contain a bacterial transposable element (Tn10, IS1 or IS2) that most likely inserted into the clone during propagation in Escherichia coli (bacterial contaminants).

The third and fourth categories consist of clones that may represent alternative transcripts (138) and clones that are currently computationally unclassified (417), respectively. The summary of the analysis of these clones is described in Table 3.

Improving the DrosophilacDNA resource

We have identified and sequenced cDNA clones that contain a complete and accurate ORF for 40% of all predicted Drosophila genes. We plan on extending this project in two ways. First, we intend to increase the number of genes represented in this set of fully vetted cDNA clones using a combination of experimental approaches. We can use site-directed mutagenesis to correct clones that carry single nucleotide changes or other small, localized defects. For the majority of the compromised clones, we have candidate replacement clones available that were identified as part of our EST sequencing and analysis efforts [2]. Generation of the Release 3 annotation of the genome made extensive use of our full-insert sequence data [15]. In the course of that effort, human curators identified a total of 2,013 clones that have become the DGCr3. The DGCr3 currently includes 309 clones chosen to replace clones with truncated ORFs, 543 clones for genes that are not currently represented in the DGC, and 833 clones that represent alternative splicing forms. To identify cDNAs for the remaining genes, we plan on using a combination of additional EST sequencing, reverse transcriptase PCR (RT-PCR) and cDNA library screening. Second, we plan on transferring ORFs to a universal cloning system (see [21,22] for examples) in order to generate a standard reagent for proteomics and other functional genomic experiments. In collaboration with Orbigen [23], we have already generated 72 baculovirus expression clones from a set of Gateway (Invitrogen, Carlsbad, CA) clones encoding transcription factors.

mRNA editing

RNA editing is a well-documented mechanism of generating nucleotide diversity beyond that directly encoded by the genome. Adenosine deaminase (ADAR) targets double-stranded regions of nuclear-encoded RNAs, catalyzing the deamination of adenosine (A) to inosine (I) [24]. Inosine mimics guanosine (G) in its base-pairing properties, and the translational machinery of the cell interprets I as G. In this way, an A-to-I conversion in the mRNA can alter the genetic information and, consequently, protein function. Null mutations in the single ADAR gene in Drosophila (dADAR) suggest that the function of pre-mRNA editing is to modify adult behavior by altering signaling components in the nervous system [25,26]. Among the mRNAs known to be edited in Drosophila are those encoded by cacophony (a calcium channel gene) [27], paralytic (a sodium channel gene) [28] and GluCla (a chloride channel gene) [29], all of which have multiple editing sites in their coding sequences.

In the course of evaluating the quality of the DGCr1 and DGCr2 cDNAs, described above, we compared their translation products to those of the recently completed Release 3 genomic sequence. Such comparisons should reveal cases of RNA editing. In cases in which the predicted protein sequences disagreed, we examined the corresponding nucleotide sequences in search of site-specific A-to-G variation between cDNA and genomic sequences. We identified over 30 candidates consistent with RNA editing; however, additional cDNA or EST data will be required to distinguish RNA editing from RT errors or strain polymorphisms. In a few cases we had enough cDNA and EST data to indicate that RNA editing is the most likely explanation for the observed variation. One such example is shown in Figure 1. The gene CG18314 encodes a G-protein-coupled receptor of the rhodopsin family, containing a seven-transmembrane protein domain with similarity to β₂-adrenergic receptors of mouse and human [30,31]. Ten potential sites of RNA editing were revealed by comparison of the genomic sequence with those of two cDNAs and three ESTs. We validated these 10 sites by gene-specific RT-PCR using RNA isolated from heads of isogenic animals and identified 15 new sites (see legend to Figure 1). We are now in the process of a more rigorous and thorough analysis of potential RNA-editing targets.

Materials and methods

Sequencing strategy

The Drosophila Gene Collection (DGC) consists of two releases, DGCr1 and DGCr2. A process flow diagram of our sequencing strategies is available online [32] and is summarized below. The clones in DGCr1 were arrayed by insert size [1] and sequenced accordingly; clones in DGCr2 were not arrayed by size. DGCr1 clones less than 1.4 kb were assembled using phrap [33] and analyzed with custom scripts to determine whether they were complete. Autofinish (part of the consed computer software package) was used to automatically design custom primers [10] for clones that needed quality improvement. Clones that did not finish in the first two rounds of Autofinish were sent to a manual finishing queue for more sophisticated finishing. cDNA clones larger than 1.4 kb were divided into three groups: 1.4 to 3 kb, 3 to 4.5 kb, and greater than 4.5 kb. All clones were sequenced using the in vitro Template Generation System (TGS^tm Finnzyme). Clones 3 to 4.5 kb in size, were sequenced using a minimal path of transposon-bearing clones. Clones, 1.4 to 3 kb and those greater than 4.5 kb, were sequenced with 24 and 48 unmapped transposon-bearing clones, respectively. After the initial cycle of transposon sequencing, the clones were analyzed using in-house scripts and Autofinish to determine their state of completeness and quality. DGCr2 clones were sequenced using 24 unmapped transposon-bearing clones. After an initial cycle of transposon sequencing, the clones were analyzed for completeness and quality as described above for DGCr1 clones, using in-house scripts and Autofinish. DGCr2 clone sequences were screened for transposable element sequences, cases of co-ligation, and presence of a poly(A) tail before any finishing work was ordered.

In vitrotransposition and mapping insertion sites

Transposon insertion reactions were carried out in 96-well format using the Template Generation System (TGS^tm) according to the manufacturer's recommendations (Finnzyme). Transposon reactions consisted of 1 μl (50-150 ng) plasmid DNA isolated from Qiagen or Revprep DNA isolation robots, 1.6 μl 5× reaction buffer, 8 ng Entranceposon (Kan^R), 0.4 μl MuA transposase, and deionized water to bring the final volume to 8 μl. Reactions were carried out in PCR plates and incubated in an ABI thermocycler according to the manufacturer's instructions. After heat inactivation of the MuA transposase, 2 μl of the reaction were used to transform 17 μl of DH5α chemically competent cells (Invitrogen) in 96-well format. Following incubation at 37°C for 1 h in 183 μl SOC medium, cells were plated onto appropriate medium selecting for vector and Entranceposon antibiotic resistance. Plates were incubated at 37°C overnight. Colonies were picked into 1.2-ml polypropylene titer tubes (E&K Scientific) containing 0.5 ml LB medium supplemented with 7.5% glycerol and the appropriate antibiotics and incubated at 37°C overnight. These stocks were then used to inoculate 1.2 ml 2XYT medium in 96-well square deep-well plates (E&K Scientific) for culture and DNA plasmid preps. Transposon insertion sites were mapped relative to the vector ends by PCR essentially as described [34]. Forty-eight transposon-bearing clones were picked for PCR mapping using the Mu-End primer (present at both ends of the tranposon) in combination with vector-specific primers, resulting in 96 PCR products. Agarose gels were imaged using custom software developed in-house (Earl Cornell, LBNL) and analyzed using an algorithm, Supertramp [35,36], to identify a minimal path of transposon-bearing clones to be re-arrayed and sequenced.

DNA sequencing

Purified plasmid DNA from transposon-bearing clones was sequenced using 2 μl ABI BigDye II Dye terminator mix (Applied Biosystems) in a 10-μl reaction. Sequencing reactions were processed through 96-well Sephadex G-50 SF plates (Multiscreen filter plates; Millipore) and loaded onto ABI Prism 3700 DNA Analyzer. Sequencing primers specific for each end of the Entranceposon were used in the reactions (5'-ATCAGCGGCCGCGATCC-3' and 5'-TTATTCGGTCGAAAAGGATCC-3'). Sequencing of 5' and 3' cDNA ends was carried out as previously described [2]. The sequencing reported here was carried out over a 2-year period during which we made several major modifications to the strategy; for example, switching from sequencing mapped transposon insertions to random transposons. These changes improved throughput and cycle time, but made the process less efficient in terms of the required number of sequencing reads. Because of these changes, it is not possible to give a meaningful single efficiency estimate; however, our overall efficiency is comparable to other efforts using a similar strategy [8,9].

Data processing and assembly

cDNA clone data management relied on custom scripts and an Informix database. Sequences were processed using phred [37,38] and assembled using phrap [33]. 5' and 3' EST end-reads were combined with the transposon-based reads to generate cDNA clone assemblies. We adopted the sequence quality-control standards defined for the Mammalian Gene Collection project [39]. Custom scripts evaluated assemblies for: 5' and 3' EST reads in a single contig in the proper orientation; at least 10 bases of 3' poly(A) tail; phrap estimated error rate of less than one in 50,000 bases; and individual base quality of at least q25. Double-stranded coverage was not a criterion for a clone to be considered finished; however, we have determined that 96.2% of all submitted bases are double-stranded and 48% of clones had complete double-stranded coverage. Autofinish [10] was used to design primers to improve quality or extend sequence from multiple sequence contigs. cDNA clones with an estimated error rate greater than one in 50,000 bp were automatically identified and processed with additional rounds of Autofinish designed finishing work. If Autofinish could not design primers, custom primers were designed manually using consed. Custom scripts were used to manually order primers to generate a further round of sequencing.

The sequence data described in this paper have been submitted to the GenBank data library under accession numbers:

AF132140-AF132196, AF160900,

AF132551-AF132560, AF160903-AF160904,

AF132562-AF132563, AF160906, AF160909,

AF132565-AF132567, AF160911-AF160913,

AF145594-AF145621, AF160916-AF160917,

AF145623-AF145684, AF160921, AF160923,

AF145686-AF145696, AF160929,

AF160879, AF160882, AF160933-AF160934,

AF160889-AF160891, AF160938-AF160944,

AF160893-AF160897, AF160947,

AF172635-AF172637, AY071209-AY071211,

AF181622-AF181650, AY071213-AY071216,

AF181652-AF181657, AY071218-AY071250,

AF184224-AF184230, AY071252-AY071266,

AY047496-AY047580, AY071268-AY071288,

AY050225-AY050241, AY071290-AY071313,

AY051411-AY052150, AY071315-AY071320,

AY058243-AY058797, AY071322-AY071331,

AY059433-AY059459, AY071333-AY071342,

AY060222-AY060487, AY071345,

AY060595-AY061633, AY071347-AY071381,

AY061821-AY061834, AY071383-AY071385,

AY069026-AY069757, AY071387,

AY069759-AY069867, AY071389-AY071406,

AY070491-AY070597, AY071408-AY071436,

AY070599-AY070602, AY071438-AY071445,

AY070604-AY070608, AY071447-AY0 71450,

AY070610-AY070623, AY071452-AY071454,

AY070625-AY070628, AY071456-AY071461,

AY070632-AY070634, AY071463-AY071476,

AY070636, AY071478-AY071489,

AY070638-AY070642, AY071491,

AY070644, AY071494-AY071543,

AY070646-AY070651, AY071545-AY071557,

AY070653-AY070656, AY071559-AY071564,

AY070658-AY070662, AY071566-AY071577,

AY070664-AY070667, AY071579-AY071581,

AY070671-AY070692, AY071583-AY071606,

AY070694-AY070716, AY071608-AY071632,

AY070777-AY070805, AY071634-AY071661,

AY070807-AY070830, AY071663-AY071664,

AY070832-AY070909, AY071666-AY071672,

AY070911-AY070913, AY071674,

AY070915-AY070920, AY071681-AY071683,

AY070922-AY070951, AY071685-AY071692,

AY070953-AY070954, AY071694-AY071703,

AY070957-AY070964, AY071705-AY071711,

AY070966, AY071713-AY071721,

AY070969-AY070973, AY071724,

AY070975-AY070985, AY071726-AY071727,

AY070987-AY071000, AY071729-AY071731,

AY071002, AY071733-AY071741,

AY071004-AY071006, AY071743-AY071745,

AY071008-AY071056, AY071747-AY071764,

AY071058-AY071064, AY071767-AY071768,

AY071066-AY071072, AY075158-AY075228,

AY071074-AY071084, AY075230-AY075262,

AY071086-AY071090, AY075264-AY075441,

AY071092, AY075443-AY075451,

AY071094-AY07H36, AY075453-AY075473,

AY071138-AY071140, AY075475-AY075524,

AY071142-AY071154, AY075526-AY075588,

AY071156-AY071157, AY084089-AY084152,

AY071159-AY071197, AY084154-AY084214,

AY071199-AY071203, AY089215-AY089229,

AY071205-AY071207, AY089231-AY089329,

AY089331-AY089461, AY118273-AY118672,

AY089463-AY089564, AYn8674-AYn8713,

AY089566-AY089601, AY118715-AY119132,

AY089603-AY089615, AY119134-AY119287,

AY089617-AY089700, AY119441-AY119665,

AY094627-AY094871, AY121612-AY121684,

AY094873-AY094970, AY121686-AY121700,

AY094996-AY095100, AY121702-AY121717,

AY095172-AY095206, AY122061-AY122270,

AY095508-AY095533, AY128413-AY128506,

AY102649-AY102700, AY129431-AY129464,

AY113190-AY113653, BT001253-BT001904.

Analysis of finished cDNA sequences

cDNA sequence was submitted to GenBank with a preliminary annotation of the longest ORF and a gene assignment based on a high BLASTN similarity score to the Release 2 genome annotations. Subsequent processing was used to determine a more detailed analysis of the clone quality. Using BLASTN, sequence from each cDNA clone was compared to genomic sequence, predicted genes, predicted coding sequences (CDSs), known Drosophila transposable elements, and Escherichia coli transposable elements. Using BLASTP, the translation of the longest ORF was compared to the predicted Release 3 translations [15]. Custom scripts were used to parse the BLAST output and record similarity results. We also compared the nucleotide sequence of each clone to the Release 3 genome sequence [14] using Sim4 and to the Release 3 predicted CDS with the highest BLAST score.

mRNA editing

We confirmed the sequence quality of the genomic region encompassing CG018314 (12,731 bp) by independently assembling an 18,284 bp contig consisting solely of whole-genome shotgun (WGS) traces. The assembled sequence contig has an average of 8.6× sequence coverage. The phrap estimated error rate for each genomic base corresponding to a mRNA edited base is q90. Similarly, we determined the phrap estimated error rate for each mRNA edited base to be q90. We manually inspected chromatograms for high-quality discrepancies in the genomic sequence and found none, indicating that the edited bases are not due to population heterozygosity. To validate the editing sites, total RNA was isolated from heads from a mixed population of male and female adult flies from the isogenic strain y¹; cn¹ bw¹ sp¹ using the Concert™ Cytoplasmic RNA isolation reagent according to the manufacturer's guidelines (Invitrogen). Nine independent gene-specific RT-PCR reactions were performed using the Superscript™ one-step RT-PCR kit according to the manufacturer (Invitrogen) and PCR products were cloned into the PCR2.1 vector. Twenty-four independent subclones from each of four independent RT-PCR products were sequenced and twelve independent subclones from an additional five independent RT-PCR products were sequenced; we considered amplicons to represent independent transcripts if they arose from different RT-PCR reactions or if they differed in sequence. The gene-specific primers used in the RT-PCR experiments were 5'-GTGCAGACGAAAACGAGATGCCAATG-3' and 5'-TGTAGTTCTTCTCAAAGGGATTACG-3'.

References

Rubin GM, Hong L, Brokstein P, Evans-Holm M, Frise E, Stapleton M, Harvey DA: A Drosophila complementary DNA resource. Science. 2000, 287: 2222-2224. 10.1126/science.287.5461.2222.
Article PubMed CAS Google Scholar
Stapleton M, Liao G, Brokstein P, Hong L, Carninci P, Shiraki T, Hayashizaki Y, Champe M, Pacleb J, Wan K, et al: The Drosophila Gene Collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes. Genome Res. 2002, 12: 1294-1300. 10.1101/gr.269102.
Article PubMed PubMed Central Google Scholar
Yu W, Andersson B, Worley KC, Muzny DM, Ding Y, Liu W, Ricafrente JY, Wentland MA, Lennon G, Gibbs RA: Large-scale concatenation cDNA sequencing. Genome Res. 1997, 7: 353-358.
PubMed CAS PubMed Central Google Scholar
Wiemann S, Weil B, Wellenreuther R, Gassenhuber J, Glassl S, Ansorge W, Bocher M, Blocker H, Bauersachs S, Blum H, et al: Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs. Genome Res. 2001, 11: 422-435. 10.1101/gr.GR1547R.
Article PubMed CAS PubMed Central Google Scholar
Strathmann M, Hamilton BA, Mayeda CA, Simon MI, Meyerowitz EM, Palazzolo MJ: Transposon-facilitated DNA sequencing. Proc Natl Acad Sci USA. 1991, 88: 1247-1250. 10.1073/pnas.88.4.1247.
Article PubMed CAS PubMed Central Google Scholar
Devine SE, Boeke JD: Efficient integration of artificial transposons into plasmid targets in vitro: a useful tool for DNA mapping, sequencing and genetic analysis. Nucleic Acids Res. 1994, 22: 3765-3772. 10.1093/nar/22.18.3765.
Article PubMed CAS PubMed Central Google Scholar
Haapa S, Suomalainen S, Eerikainen S, Airaksinen M, Paulin L, Savilahti H: An efficient DNA sequencing strategy based on the bacteriophage mu in vitro DNA transposition reaction. Genome Res. 1999, 9: 308-315.
PubMed CAS PubMed Central Google Scholar
Shevchenko Y, Bouffard GG, Butterfield YS, Blakesley RW, Hartley JL, Young AC, Marra MA, Jones SJ, Touchman JW, Green ED: Systematic sequencing of cDNA clones using the transposon Tn5. Nucleic Acids Res. 2002, 30: 2469-2477. 10.1093/nar/30.11.2469.
Article PubMed CAS PubMed Central Google Scholar
Butterfield YS, Marra MA, Asano JK, Chan SY, Guin R, Krzywinski MI, Lee SS, MacDonald KW, Mathewson CA, Olson TE, et al: An efficient strategy for large-scale high-throughput transposon-mediated sequencing of cDNA clones. Nucleic Acids Res. 2002, 30: 2460-2468. 10.1093/nar/30.11.2460.
Article PubMed CAS PubMed Central Google Scholar
Gordon D, Desmarais C, Green P: Automated finishing with autofinish. Genome Res. 2001, 11: 614-625. 10.1101/gr.171401.
Article PubMed CAS PubMed Central Google Scholar
Walhout AJ, Boulton SJ, Vidal M: Yeast two-hybrid systems and protein interaction mapping projects for yeast and worm. Yeast. 2000, 17: 88-94. 10.1002/1097-0061(20000630)17:2<88::AID-YEA20>3.0.CO;2-Y.
Article PubMed CAS PubMed Central Google Scholar
Vidal M: A biological atlas of functional maps. Cell. 2001, 104: 333-339. 10.1016/S0092-8674(01)00221-5.
Article PubMed CAS Google Scholar
Zhu H, Snyder M: "Omic" approaches for unraveling signaling networks. Curr Opin Cell Biol. 2002, 14: 173-179. 10.1016/S0955-0674(02)00315-0.
Article PubMed CAS Google Scholar
Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, Patel S, Adams M, Champe M, Dugan SP, Frise E, et al: Finishing a whole shotgun sequence assembly: Release 3 of the Drosophila euchromatic sequence. Genome Biol. 2002, 3 (12): research0079.1-0079.14. 10.1186/gb-2002-3-12-research0079.
Article Google Scholar
Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell K, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, et al: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 2002, 3: research0083.1-0083.22. 10.1186/gb-2002-3-12-research0083.
Article Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Article PubMed CAS PubMed Central Google Scholar
Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998, 8: 967-974.
PubMed CAS PubMed Central Google Scholar
Powell JR: Progress and prospects in evolutionary biology: the Drosophila model. New York: Oxford University Press. 1997, 344-420.
Google Scholar
Shendure J, Church GM: Computational discovery of sense-antisense transcription in the human and mouse genomes. Genome Biol. 2002, 3: research0044.1-0044.14. 10.1186/gb-2002-3-9-research0044.
Article Google Scholar
Kaminker JS, Bergman C, Kronmiller B, Carlson J, Svirskas R, Patel S, Frise E, Wheeler DL, Lewis SE, Rubin GM, et al: The transposable elements of the Drosophila melanogaster euchromatin - a genomics perspective. Genome Biol. 2002, 3: research0084.1-0084.20. 10.1186/gb-2002-3-12-research0084.
Article Google Scholar
Walhout AJ, Temple GF, Brasch MA, Hartley JL, Lorson MA, van den Heuvel S, Vidal M: GATEWAY recombinational cloning: application to the cloning of large numbers of open reading frames or ORFeomes. Methods Enzymol. 2000, 328: 575-592. 10.1016/S0076-6879(00)28419-X.
Article PubMed CAS Google Scholar
BD Creator™ Gene Expression Systems. [http://www.clontech.com/products/families/creator/index.shtml]
Orbigen. [http://www.orbigen.com]
Bass BL: RNA editing by adenosine deaminases that act on RNA. Annu Rev Biochem. 2002, 71: 817-846. 10.1146/annurev.biochem.71.110601.135501.
Article PubMed CAS PubMed Central Google Scholar
Palladino MJ, Keegan LP, O'Connell MA, Reenan RA: A-to-I pre-mRNA editing in Drosophila is primarily involved in adult nervous system function and integrity. Cell. 2000, 102: 437-449. 10.1016/S0092-8674(00)00049-0.
Article PubMed CAS Google Scholar
Ma E, Tucker M, Chen Q, Haddad G: Developmental expression and enzymatic activity of pre-mRNA deaminase in Drosophila melanogaster. Brain Res Mol Brain Res. 2002, 102: 100-104. 10.1016/S0169-328X(02)00186-9.
Article PubMed CAS Google Scholar
Smith LA, Peixoto AA, Hall JC: RNA editing in the Drosophila DMCA1A calcium-channel α1 subunit transcript. J Neurogenet. 1998, 12: 227-240. 10.3109/01677069809108560.
Article PubMed CAS Google Scholar
Reenan RA, Hanrahan CJ, Barry G: The mle^naptsRNA helicase mutation in Drosophila results in a splicing catastrophe of the para Na⁺ channel transcript in a region of RNA editing. Neuron. 2000, 25: 139-149. 10.1016/S0896-6273(00)80878-8.
Article PubMed CAS Google Scholar
Semenov EP, Pak WL, Related Articles, Links: Diversification of Drosophila chloride channel gene by multiple posttranscriptional mRNA modifications. J Neurochem. 1999, 72: 66-72. 10.1046/j.1471-4159.1999.0720066.x.
Article PubMed CAS Google Scholar
Allen JM, Baetge EE, Abrass IB, Palmiter RD: Isoproterenol response following transfection of the mouse β₂-adrenergic receptor gene into Y1 cells. EMBO J. 1988, 7: 133-138.
PubMed CAS PubMed Central Google Scholar
Kobilka BK, Dixon RA, Frielle T, Dohlman HG, Bolanowski MA, Sigal IS, Yang-Feng TL, Francke U, Caron MG, Lefkowitz RJ: cDNA for the human-adrenergic receptor: a protein with multiple β₂ membrane-spanning domains and encoded by a gene whose chromosomal location is shared with that of the receptor for platelet-derived growth factor. Proc Natl Acad Sci USA. 1987, 84: 46-50. 10.1073/pnas.84.1.46.
Article PubMed CAS PubMed Central Google Scholar
cDNA production process flow. [http://www.fruitfly.org/DGC/FLSWorkflow.html]
The Phred/Phrap/Consed system home page. [http://www.phrap.org]
Kimmel B, Palazzolo MJ, Martin CH, Boeke JD, Devine SE: Transposon-mediated DNA sequencing. In Genome Analysis: a laboratory manual. Analyzing DNA. Edited by: Birren B, Green ED, Klapholz S, Myers RM, Roskams J. 1997, Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, 455-532.
Google Scholar
Veklerov E, Martin CH, Theil EH: TRAMP: a software package for generating transposon maps. Comput Appl Biosci. 1995, 11: 173-179.
PubMed CAS Google Scholar
Cawley SE, Speed TP: DNA sequencing with transposons. J Comput Biol. 2000, 7: 717-729. 10.1089/106652701446161.
Article PubMed CAS Google Scholar
Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8: 175-185.
Article PubMed CAS Google Scholar
Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8: 186-194.
Article PubMed CAS Google Scholar
Mammalian Gene Collection. [http://mgc.nci.nih.gov]

Download references

Acknowledgements

We thank Catherine Nelson for critically reading and improving the manuscript, Sandeep Patel for help during the manual finishing phase of cDNA sequencing, Michelle Chew for excellent technical assistance in the early stages of this project, and the entire staff of the BDGP sequencing center. We also thank the FlyBase curators for their help in identifying the DGCr3, and Erwin Frise and Eric Smith for system administration. This work was supported by NIH Grant P50-HG00750 (GMR), Department of Energy Grant nos. DE-FG03-98ER62625 and DE-FG03-99ER62739 (GMR), and performed under Department of Energy Contract DE-AC0376SF00098, University of California.

Author information

Peter Brokstein
Present address: Incyte Genomics, 3160 Porter Drive, Palo Alto, CA, 94304, USA

Authors and Affiliations

Berkeley Drosophila Genome Project Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
Mark Stapleton, Joe Carlson, Peter Brokstein, Charles Yu, Mark Champe, Reed George, Hannibal Guarin, Brent Kronmiller, Joanne Pacleb, Soo Park, Ken Wan, Gerald M Rubin & Susan E Celniker
Genome Sciences Department, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
Mark Stapleton, Joe Carlson, Peter Brokstein, Charles Yu, Mark Champe, Reed George, Hannibal Guarin, Brent Kronmiller, Joanne Pacleb, Soo Park, Ken Wan & Susan E Celniker
Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
Gerald M Rubin
Howard Hughes Medical Institute, University of California, Berkeley, CA, 94720, USA
Gerald M Rubin
Applied Biosystems, 850 Lincoln Centre Drive, Foster City, CA, 94404, USA
Mark Champe
Department of Bioinformatics and Computational Biology, Iowa State University, Ames, IO, 50011, USA
Brent Kronmiller

Authors

Mark Stapleton
View author publications
You can also search for this author in PubMed Google Scholar
Joe Carlson
View author publications
You can also search for this author in PubMed Google Scholar
Peter Brokstein
View author publications
You can also search for this author in PubMed Google Scholar
Charles Yu
View author publications
You can also search for this author in PubMed Google Scholar
Mark Champe
View author publications
You can also search for this author in PubMed Google Scholar
Reed George
View author publications
You can also search for this author in PubMed Google Scholar
Hannibal Guarin
View author publications
You can also search for this author in PubMed Google Scholar
Brent Kronmiller
View author publications
You can also search for this author in PubMed Google Scholar
Joanne Pacleb
View author publications
You can also search for this author in PubMed Google Scholar
Soo Park
View author publications
You can also search for this author in PubMed Google Scholar
Ken Wan
View author publications
You can also search for this author in PubMed Google Scholar
Gerald M Rubin
View author publications
You can also search for this author in PubMed Google Scholar
Susan E Celniker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Stapleton.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stapleton, M., Carlson, J., Brokstein, P. et al. A Drosophilafull-length cDNA resource. Genome Biol 3, research0080.1 (2002). https://doi.org/10.1186/gb-2002-3-12-research0080

Download citation

Received: 21 October 2002
Revised: 27 November 2002
Accepted: 27 November 2002
Published: 23 December 2002
DOI: https://doi.org/10.1186/gb-2002-3-12-research0080

A Drosophilafull-length cDNA resource

Abstract

Background

Results

Conclusions

Background

Results and discussion

Sequencing strategy

Evaluating the coding potential of each cDNA on the basis of its full-insert sequence

Improving the DrosophilacDNA resource

mRNA editing

Materials and methods

Sequencing strategy

In vitrotransposition and mapping insertion sites

DNA sequencing

Data processing and assembly

Analysis of finished cDNA sequences

mRNA editing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Genome Biology

Contact us

A Drosophilafull-length cDNA resource

Abstract

Background

Results

Conclusions

Background

Results and discussion

Sequencing strategy

Evaluating the coding potential of each cDNA on the basis of its full-insert sequence

Improving the DrosophilacDNA resource

mRNA editing

Materials and methods

Sequencing strategy

In vitrotransposition and mapping insertion sites

DNA sequencing

Data processing and assembly

Analysis of finished cDNA sequences

mRNA editing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genome Biology

Contact us