Full-length cDNAs from chicken bursal lymphocytes to facilitate gene function analysis
- Randolph B Caldwell†1,
- Andrzej M Kierzek†2, 3,
- Hiroshi Arakawa1,
- Yuri Bezzubov1,
- Jolanta Zaim2,
- Petra Fiedler1,
- Stefan Kutter1,
- Artem Blagodatski1,
- Diyana Kostovska1,
- Marek Koter1,
- Jiri Plachy4,
- Piero Carninci5, 6,
- Yoshihide Hayashizaki5, 6 and
- Jean-Marie Buerstedde1Email author
© Caldwell et al.; licensee BioMed Central Ltd. 2004
Received: 7 September 2004
Accepted: 7 December 2004
Published: 23 December 2004
A large number of cDNA inserts were sequenced from a high-quality library of chicken bursal lymphocyte cDNAs. Comparisons to public gene databases indicate that the cDNA collection represents more than 2,000 new, full-length transcripts. This resource defines the structure and the coding potential of a large fraction of B-cell specific and housekeeping genes whose function can be analyzed by disruption in the chicken DT40 B-cell line.
Large-scale genomic and cDNA sequencing projects have revealed thousands of new genes whose open reading frames (ORFs) are highly conserved during vertebrate evolution, but whose precise cellular functions remain unclear. Although functional analysis by gene disruption is possible after transfection of murine embryonic stem cells and the breeding of knockout mice , these whole-animal studies are laborious and expensive. If the mutant phenotype can be distinguished in cell culture, the chicken B-cell line DT40 is a valid alternative to murine knockouts because of its high ratio of targeted gene integration [2–4]. Additional advantages of DT40 are tightly regulated conditional gene-expression systems for the analysis of essential genes [5–7] and the ability to study genetic interactions by the stepwise modification of multiple loci  and marker recycling .
The recent release of the chicken genome sequence  greatly benefits the DT40 research community. For the first time, the entire genome can be searched for sequences that are conserved during vertebrate evolution and whose function might be clarified after genetic modification in DT40. However, in silico gene structure prediction methods have a high error rate and often do not correctly annotate the intron-exon structure of genes. Only full-length cDNAs unambiguously define the boundaries of the transcription units within whole-genome assemblies and cloned full-length cDNAs are also of immense practical value to complement mutant phenotypes and artificially express the encoded protein . For these reasons, many genome sequencing projects in higher eukaryotes have been complemented by large-scale efforts to obtain a maximum number of full-length cDNAs [11, 12]. Although relatively large expressed sequence tag (EST) databases from bursal lymphocytes  and other tissues have been described , relatively few chicken cDNA sequences had been deposited in the public databases.
Here we describe a project to sequence and characterize a large number of full-length cDNAs from bursal lymphocytes. The corresponding genes are likely to be expressed in DT40 and this should facilitate their analysis by targeted gene modifications. In combination with the recently released cDNAs from other tissues , the bursal cDNAs will be a valuable resource for many laboratories working with the chicken as a model organism.
Results and discussion
Generation of bursal cDNA sequences
The plasmids corresponding to the remaining 2,796 ESTs were chosen for full insert sequencing by bidirectional primer walks. Once the end of the walks had been reached, the sequences of the full-length cDNA inserts were assembled. From the BLAST search results the most likely methionine start codon was assigned to each sequence. About 15% of the cDNA sequences showed evidence for premature frameshifts in the form of short ORFs and stretches of conserved sequence in a different frame further 3'. If overlapping ESTs were found in the public databases, the cDNA sequences were edited to correct the likely reverse transcription error, otherwise these sequences were discarded.
Length distribution and GC content
The most remarkable feature noted in the analysis of 5' UTRs of the bursal cDNAs is a very high GC content (67%). This supports the observation that the GC content of 5' UTRs is particularly high in warm-blooded species . On the other hand, the percentage of GC base-pairs in 3' UTRs of the bursal cDNAs (41%) is close to the value observed for database sequences (42%). The ORFs of the bursal full-length cDNAs contain 49% GC base-pairs.
Analysis of start codon context
The accurate prediction of the translation start codon remains difficult and in some cases our annotations remain tentative. Sequences surrounding the translation start codons are not random and in mammals match the consensus GCCRCCaugG (where aug is the start codon and R is either A or G) . The most conserved nucleotides in the consensus are a purine, usually A, at position-3 and G at position 4. It has also been observed that a large fraction of 5' UTRs contain AUG codons upstream of the translation start site, but these codons are unlikely to be flanked by the consensus sequence .
A detailed analysis shows that the riken1 collection of cDNA sequences contains 4,406 AUG codons upstream of the annotated translation start codons in 2,218 of the bursal cDNAs. Nine hundred one of these alternative start codons were in the same reading frame as the annotated ORF. An in-frame stop codon within the 5' UTR region was present downstream of 501 of these 901 alternative start codons. The total number of ORFs present in 5' UTR regions of riken1 cDNAs was 1,289.
Similarity to predicted Ensembl transcripts and UniProt protein sequences
Comparison of riken1 cDNAs and cDNAs predicted by Ensembl
Number of riken1 cDNAs
Percentage of the total number of riken1 cDNAs
More than 90% identity and more than 90% coverage
More than 90% identity and more than 70% coverage
More than 90% identity and more than 50% coverage
Less than 90% identity
3' or 5' UTR of riken1 cDNA uncovered
5' UTR of riken1 cDNA uncovered
Figure 4b shows the distribution of the percent identity and coverage statistics of the BLASTP comparison of the proteins encoded by the bursal cDNAs to the UniProt collection of protein sequences. In most cases (1,524), the proteins encoded by riken1 cDNAs were almost fully covered in the alignments (more than 90% coverage) and showed a high percentage identity (greater than 70%) to known protein sequences.
When compared to available chicken ESTs or cDNAs in the public databases, some of the bursal cDNAs showed significant structural differences most likely due to differential transcript processing. In addition, the bursal cDNA collection has been used to define a large number of intragenic single-nucleotide polymorphisms (SNPs) .
Functional domain assignment
The 10 most frequently occurring Pfam domains in chicken full-length cDNAs
Number of occurrences
Protein kinase domain
RRM_1: RNA recognition motif
Arf: ADP-ribosylation factor family
PH: pleckstrin homology domain.
Helicase_C, Helicase conserved C-terminal domain
SH3 (Src homology 3) domain
BTB or POZ domain present in some of zinc finger proteins
DEAD, DEAD/DEAH box helicase
AAA, ATPase family associated with various cellular activities
The 10 molecular function GO terms most frequently assigned to chicken cDNAs
Number of occurrences*
Protein kinase activity
Molecular function unknown
Transcription factor activity
Protein transporter activity
Zinc ion binding
The 10 biological process GO terms most frequently assigned to chicken cDNAs
Number of occurrences*
Protein amino acid phosphorylation
Regulation of transcription, DNA-dependent
Proteolysis and peptidolysis
Small GTPase mediated signal transduction
Intracellular protein transport
There are 22 full-length cDNAs in our collection containing Pfam domains annotated by the GO term 'molecular function unknown'. Experimental information concerning the molecular mechanisms of action is very sparse or nonexistent for proteins sharing these evolutionarily conserved domains. Highly similar human proteins exist for the chicken proteins, an example being the human protein BM02. Taking into account the ease of targeted genome modification and availability of numerous functional assays, the DT40 cell line is an attractive model system to provide first insights into the functions of the evolutionarily conserved domains described above.
Bursal Transcript database
All the full-length cDNA sequences are stored within the Bursal Transcript database . This database links the previously published EST data with the new cDNAs and can be searched by keyword or by using BLAST. Browsing of functional categories is also available as dynamically generated web pages link the bursal cDNAs to Ensembl, UniProt, Pfam and to GO data. To highlight gene expression differences between DT40 and bursal cells, the bursal cDNAs are also linked to SAGE data from both of these types of cells .
The cDNAs from bursal lymphocytes represent one of the largest full-length cDNA collections in the chicken, comprising about one third of all currently available, experimentally verified transcripts and will be of general interest to researchers using the chicken as an experimental model as well as to the poultry industry. The resource has already been integrated with the chicken genome sequences to build a unigene catalog , to define the nature and frequency of intragenic chicken strain polymorphisms  and to develop a chicken gene microarray for gene-expression profiling (B. Wong, T. Makeev and C. Davies, unpublished data). However, the main beneficiary of the full-length cDNAs is the DT40 research community. Although the release of the genome sequence has greatly simplified the identification of candidate genes for disruption and the design of the knockout constructs, it is still not a trivial task to predict the ORFs as well as 3' and 5' UTRs without cDNA sequences. Other uses are the expression of the cDNAs in vitro or for complementation of mutant DT40 phenotypes with the added convenience that the cDNA sequences are not only known, but also available as cloned pieces of fully sequenced DNA.
Materials and methods
Construction of the riken1 cDNA library and 5' EST sequencing
The riken1 library was synthesized from mRNA of 2-week-old CB strain bursal lymphocytes using the biotinylated cap trapper method [16, 31]. The resulting phage library was converted into pKS-derived plasmids and individual clones were then selected on ampicillin-containing agarose plates. About 45,000 colonies were picked and transferred into 384-well microtiter plates to prepare a permanent clone stock. Plasmids from 14,976 of the arrayed clones were sequenced on an Applied Biosystems automated sequencer using a primer that anneals to the plasmid backbone upstream of the 5' end of the cDNA inserts (see  for details of the cloning vector sequence). The ABI sequencing files were processed as described previously . About 5% of the riken1 clones contained an insert sequence which was 100% identical to the GenBank entry AJ277662, annotated as a human genomic fragment including the LMO1 locus. This sequence was present as a stuffer of the lambda vector used for the library construction and the clones containing it were removed from further analysis. In total, the 5' single-pass sequencing of 14,976 clones yielded 11,116 high-quality ESTs of the riken1 library.
Selection of clones for full-length insert sequencing
BLAST searches against the 'All non-redundant GenBank CDS' database showed that approximately 80% of the 5' EST sequences matched GenBank entries with a score of at least 50. This score threshold was chosen because it allowed us in most cases to align the putative start codon of the query sequence to the EST. These sequences were chosen and clustered  to remove duplicates. In addition, all sequences matching chicken entries in the public databases with a score of over 300 were not considered further. The BLAST results of all remaining sequences were manually inspected and only those sequences which covered the methionine start codon of their closest match in the public databases were retained. In the end, the cDNA inserts corresponding to 2,796 ESTs were chosen for full insert sequencing.
Full-length insert sequencing
Sufficient plasmid template for numerous sequencing reactions was prepared from the clones corresponding to the selected ESTs. All plasmids were then sequenced with a primer complementary to a plasmid sequence 3' of the cDNA insertion site. Subsequently, custom-made 20-mer primers based on available sequences were used for sequencing until the 3' and 5' ends of the cDNA inserts were reached. All sequences were processed as described previously , except that a routine of manual proofreading and editing of the chromatograms within the Staden pregap program was implemented to increase the quality of the base calling and to decrease the failure rate of the next primer walk. The FOUNTAIN software  was extended to automatically design primers in 96-well format suitable for these walks. The primers were positioned to give an average of a 70-bp overlap between sequences. Once both ends of the insert were reached by the primer walks, the Staden gap4 program  was used to produce a double-stranded consensus of the cDNA insert. A total of 2,565 high-quality cDNA contigs were assembled for further analysis.
Quality check and correction of frameshifts in the cDNA sequences
The integrity of the conserved ORF within each assembled cDNA sequence was manually examined by inspecting BLAST search results against the public protein and EST databases. To facilitate this task, a new EstSet module was added to FOUNTAIN . The user interface displays the sequence of the cDNA insert together with its three possible translations and its BLAST search results against the public protein and EST databases. On the basis of this information a likely methionine start codon can be assigned to the cDNA. Around 15% of the cDNA sequences showed evidence of an artificial frameshift in the form of suspicious BLAST matches in two or more ORFs, presumably due to errors in the reverse transcription process. These sequences were compared to other Gallus gallus ESTs from the public databases. If the short ORF could be corrected by adopting the sequence of an overlapping EST, the cDNA sequence was edited. The type of editing was recorded and the corresponding riken1 clone was annotated as likely to be defective. In total, 293 cDNAs were removed because either a likely artificial frameshift could not be corrected by using sequences of overlapping ESTs or they contained multiple stop codons in all three reading frames or they showed evidence for unspliced introns. All cDNA clones are freely available upon request to the corresponding author and their sequences have been submitted to the EMBL public database (accession numbers AJ719267-AJ721138 and AJ851370-AJ851825).
Analysis of the start codon context
The sequences surrounding the annotated start codons (10 bp upstream and downstream) were exported and submitted for information content visualization by the WebLogo software . Subsequently, we have exported the ± 10 bp context of every ATG codon located upstream of the annotated start of the coding sequence. These sequences were also submitted to analysis by WebLogo software.
Sequence-similarity searches and functional class annotation
The riken1 cDNAs were compared with the collection of predicted chicken transcripts downloaded from the Ensembl ftp site  using the BLASTN program. BLASTP software was used to compare translated ORFs with the protein sequences stored in the UniProt database. Functional domains were assigned by comparing riken1 cDNAs with sequence profiles representing Pfam domains. This comparison was performed with RPSBLAST software (e-value cut-off of 10-6) run on the binary database files downloaded from the National Center for Biotechnology Information (NCBI). Functional classes were assigned according to Pfam to GO mapping provided by the InterPro database. The XML information exchange standard was used to interface the BLAST program outputs with the FOUNTAIN package.
This work was supported by the EU grants 'Chicken IMAGE', 'Genetics in a cell line' (QLK3-2000-00785) and 'Mechanisms of gene integration' (LSHG-CT-2003-503303).
- Smithies O: Animal models of human genetic diseases. Trends Genet. 1993, 9: 112-116. 10.1016/0168-9525(93)90204-U.View ArticleGoogle Scholar
- Buerstedde JM, Takeda S: Increased ratio of targeted to random integration after transfection of chicken B cell lines. Cell. 1991, 67: 179-188. 10.1016/0092-8674(91)90581-I.View ArticleGoogle Scholar
- Kurosaki T: Genetic analysis of B cell antigen receptor signalling. Annu Rev Immunol. 1999, 17: 555-592. 10.1146/annurev.immunol.17.1.555.View ArticleGoogle Scholar
- Arakawa H, Buerstedde JM: Immunoglobulin gene conversion: insights from bursal B cells and the DT40 cell line. Dev Dyn. 2004, 229: 458-464. 10.1002/dvdy.10495.View ArticleGoogle Scholar
- Wang J, Takagaki Y, Manley JL: Targeted disruption of an essential vertebrate gene: ASF/SF2 is required for cell viability. Genes Dev. 1996, 10: 2588-2599.View ArticleGoogle Scholar
- Fukagawa T, Brown WR: Efficient conditional mutation of the vertebrate CENP C gene. Hum Mol Genet. 1997, 6: 2301-2308. 10.1093/hmg/6.13.2301.View ArticleGoogle Scholar
- Arakawa H, Lodging D, Buerstedde JM: Mutant lox vectors for selectable marker recycle and conditional knock-outs. BMC Biotechnol. 2001, 1: 7-10.1186/1472-6750-1-7.View ArticleGoogle Scholar
- Toccata M, Sasaki MS, Sonora E, Morrison C, Hashimoto M, Assume H, Yamaguchi-Iwai Y, Shinohara A, Takeda S: Homologous recombination and non-homologous end-joining pathways of DNA double-strand break repair have overlapping roles in the maintenance of chromosomal integrity in vertebrate cells. EMBO J. 1998, 17: 5497-5508. 10.1093/emboj/17.18.5497.View ArticleGoogle Scholar
- International Chicken Genome Sequencing Consortium: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004, 432: 695-716. 10.1038/nature03154.View ArticleGoogle Scholar
- The FANTOM Consortium and the RIKEN Genome Exploration Research Group Phase I and II Team: Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature. 2002, 420: 563-573. 10.1038/nature01266.View ArticleGoogle Scholar
- Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata Y, Tanino M, et al: Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2004, 2: E162-10.1371/journal.pbio.0020162.View ArticleGoogle Scholar
- Baldarelli RM, Hill DP, Blake JA, Adachi J, Furuno M, Bradt D, Corbani LE, Cousins S, Frazer KS, Qi D, et al: Connecting sequence and biology in the laboratory mouse. Genome Res. 2003, 13: 1505-1519. 10.1101/gr.991003.View ArticleGoogle Scholar
- Abdrakhmanov I, Lodygin D, Geroth P, Arakawa H, Law A, Plachy J, Korn B, Buerstedde JM: A large database of chicken bursal ESTs as a resource for the analysis of vertebrate gene function. Genome Res. 2000, 10: 2062-2069. 10.1101/gr.10.12.2062.View ArticleGoogle Scholar
- Boardman PE, Sanz-Ezquerro J, Overton IM, Burt DW, Bosch E, Fong WT, Tickle C, Brown WR, Wilson SA, Hubbard SJ: A comprehensive collection of chicken cDNAs. Curr Biol. 2002, 12: 1965-1969. 10.1016/S0960-9822(02)01296-4.View ArticleGoogle Scholar
- Hubbard SJ, Grafham DV, Beattie KJ, Overton IM, McLaren SR, Croning MDR, Boardman PE, Bonfield JK, Burnside J, Davies RM, et al: Transcriptome analysis for the chicken based on 19,626 finished cDNA sequences and 485,337 expressed sequence tags. Genome Research. 2004, DOI:10.1101/gr.3011405Google Scholar
- Carninci P, Shibata Y, Hayatsu N, Sugahara Y, Shibata K, Itoh M, Konno H, Okazaki Y, Hayashizaki Y: Normalization and subtraction of cap-trapper-selected cDNAs to prepare full-length cDNA libraries for rapid discovery of new genes. Genome Res. 2000, 10: 1617-1630. 10.1101/gr.145100.View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.View ArticleGoogle Scholar
- Bursal Transcript Database. [http://pheasant.gsf.de/DEPARTMENT/DT40/dt40Transcript.html]
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004, 32 (Database issue): D115-D119. 10.1093/nar/gkh131.View ArticleGoogle Scholar
- Pesole G, Liuni S, Grillo G, Licciulli F, Mignone F, Gissi C, Saccone C: UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 200. Nucleic Acids Res. 2002, 30: 335-340. 10.1093/nar/30.1.335.View ArticleGoogle Scholar
- Mignone F, Gissi C, Liuni S, Pesole G: Untranslated regions of mRNAs. Genome Biol. 2002, 3: reviews0004-10.1186/gb-2002-3-3-reviews0004.View ArticleGoogle Scholar
- Pesole G, Liuni S, Grillo G, Saccone C: Structural and compositional features of untranslated regions of eukaryotic mRNAs. Gene. 1997, 205: 95-102. 10.1016/S0378-1119(97)00407-1.View ArticleGoogle Scholar
- Kozak M: Pushing the limits of the scanning mechanism for initiation of translation. Gene. 2002, 299: 1-34. 10.1016/S0378-1119(02)01056-9.View ArticleGoogle Scholar
- Crooks GE, Hon G, Chandonia JM: WebLogo: a sequence logo generator. Genome Res. 2004, 14: 1188-1190. 10.1101/gr.849004.View ArticleGoogle Scholar
- Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T, et al: An overview of Ensembl. Genet Res. 2004, 14: 925-928.View ArticleGoogle Scholar
- Wahl MB, Caldwell RB, Kierzek AM, Arakawa H, Eyras E, Hubner N, Jung C, Soeldenwagner M, Cervelli M, Wang YD, et al: Evaluation of the chicken transcriptome by SAGE of B cells and the DT40 cell line. BMC Genomics.Google Scholar
- International Chicken Polymorphism Map Consortium.: A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. Nature. 2004, 432: 717-722. 10.1038/nature03156.View ArticleGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al: The Pfam protein families database. Nucleic Acids Res. 2004, 32 (Database issue): D138-D141. 10.1093/nar/gkh121.View ArticleGoogle Scholar
- The Gene Ontology Consortium.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, 32 (Database issue): D258-D261. 10.1093/nar/gkh036.View ArticleGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, et al: The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 2003, 31: 315-318. 10.1093/nar/gkg046.View ArticleGoogle Scholar
- Carninci P, Hayashizaki Y: High-efficiency full-length cDNA cloning. Methods Enzymol. 1999, 303: 19-44.View ArticleGoogle Scholar
- Buerstedde JM, Prill F: FOUNTAIN: a JAVA open-source package to assist large sequencing projects. BMC Bioinformatics. 2001, 2: 6-10.1186/1471-2105-2-6.View ArticleGoogle Scholar
- Staden R, Judge DP, Bonfield JK: Sequence assembly and finishing methods. Methods Biochem Anal. 2001, 43: 303-322.View ArticleGoogle Scholar
- Ensembl. [ftp://ftp.ensembl.org]
- WebLogo. [http://weblogo.berkeley.edu]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.