Skip to content


  • Correspondence
  • Open Access

Hidden genes in birds

Genome Biology201516:164

  • Published:


We report that a subset of avian genes is characterized by very high GC content and long G/C stretches. These sequence characteristics correlate with the frequent absence of these genes from genomic databases. We provide several examples where genes in this subset are mistakenly reported as missing in birds.


  • Sequence Read Archive
  • Chicken Genome
  • Avian Gene
  • EPOR Gene
  • Frequent Absence

Main text

A recent paper reported 274 genes as missing in birds but present in the genomes of most other vertebrate lineages [1]. Here, we describe several genes from this list that are, in fact, present in the chicken genome. Importantly, we would like to draw attention to a subset of avian genes characterized by high GC content and multiple long GC-rich stretches. We suggest that the characteristics of these sequences are behind the frequent absence of this gene category from genomic assemblies and other sequence databases. However, the fact is that these genes can, in many cases, be reconstructed from large amounts of “raw” next-generation sequence (NGS) data available from the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI).

Pursuing our long-term interest in chicken hematopoiesis, we noticed that the gene cluster reported in Figure 2 of Lovell et al. [1] next to the erythropoietin receptor (EPOR) shows the LPPR2 gene as missing in birds. However, we already knew that the EPOR and LPPR2 genes existed in the chicken. The sequences of both these genes are in line with the GC-rich characteristics mentioned above. Furthermore, we have examined, though not exhaustively, the list of 274 genes reported as missing in birds [1]. Using mammalian and other vertebrate orthologs of these genes, we analyzed NCBI’s SRA datasets from the chicken and other birds. In this way, we were able to reconstruct two other chicken genes, MMP14 and MRPL52. The sequences of the chicken LPPR2, MMP14 and MRPL2 genes (Additional file 1) were assembled from multiple pooled RNA-seq datasets from the SRA. Several lines of evidence indicate that these genes are, in fact, the orthologs of corresponding genes in non-avian vertebrates. First, their sequences are absent from the current chicken assembly, or are present only as small fragments in unidentified genomic contigs. Second, phylogenetic analysis (Additional file 2) confirms that they are correctly placed with orthologous genes – not with their closest paralogs, MMP15 and LPPR5. Finally, for LPPR2 there is at least partial information showing correct synteny in birds. We have assembled the Tibetan ground tit (Pseudopodoces humilis) LPPR2, which lies on the same 46-kb genomic scaffold [GenBank: NW_005087926] in P. humilis as EPOR and SWSAP1. This is in keeping with gene arrangement in mammals.

The newly identified chicken MMP14 and MRPL2 genes also showed the GC-rich sequence characteristics. To show that this sequence pattern causes persistent problems for correct gene assembly, we analyzed the 89 genes (Supplemental Table 6A in [1]) reported as missing in chicken but present in some other bird species. Using these bird genes as probes, we were able to use the chicken SRA data to assemble several genes from this list (ALKBH7, BLVRB, INO80E, NDUFB7, OPLAH, PCP2, PET100, and SWSAP1) (Additional file 1). Indeed, as shown in Fig. 1, most of the 89 genes are clear outliers on account of their GC% and G/C-rich stretches. The majority of the 89 genes are from the P. humilis genome [2], whose assembly is, in our view, the most complete in terms of coverage of GC-rich avian genes. The distributions of GC% and G/C-stretches in P. humilis genes do not differ from those in the genes of other bird species (Additional file 3). Therefore, there is no systematic bias in the sequence composition of the majority of P. humilis genes.
Fig. 1
Fig. 1

Patterns of GC content and G/C stretches in avian and other vertebrate genes. a Dot plot of avian genes, displaying the GC-content and average length of stretches containing G or C nucleotides. G/C-stretch was defined as an undisrupted sequence of at least three consecutive G or C nucleotides. The complete set of approximately six thousand chicken RefSeq genes from the UCSC genome browser database [17] is depicted by blue circles. Only coding sequences longer than 299 nucleotides were analyzed. The set of 86 avian genes reported to be missing in chicken genome [1] are depicted by open circles, and 23 avian genes newly assembled in this study are shown as red circles. Additionally, a histogram showing the distribution of average G/C-stretch length in the chicken RefSeq gene category is depicted by a blue line. b Dot plots of selected avian genes, compared with their vertebrate orthologs. GC-content and average length of G/C stretches in coding sequences of chicken MMP14 and LPPR2 (reported as missing in birds [1]), and genes from the EPO and EPOR loci are shown. If available, orthologous genes from other birds, turtles, mammals, lizards and crocodilians are included in the plots. The blue dots show the distribution of chicken RefSeq genes. Sequences of newly assembled avian genes represented in this figure, and GenBank accession numbers of sequences plotted in panel B are listed in Additional file 1 and Additional file 4, respectively

Furthermore, we report here for the first time the sequences of chicken erythropoietin (EPO) and EPOR genes (Additional file 1), which also share the GC-rich sequence characteristics. These genes were absent from nucleotide databases, and it was assumed that avian hematopoiesis did not require EPO signaling since primary chicken erythroid progenitors were not EPO-dependent [36]. Therefore, the identification of chicken EPO and EPOR genes allows us to test whether avian EPO retains the biological activity it has in other vertebrates.

All these newly assembled avian genes previously considered missing in all birds or in the chicken share similar GC-rich sequence characteristics. GC-rich genes are extremely hard to amplify by PCR, a key step in NGS library preparation [7, 8]. These technical hurdles are presumably behind the absence of this gene subset from genomic databases. In particular, regions of long and concatenated GC-rich stretches cause an extreme decrease in the coverage by NGS reads. Therefore, the assembly of genes in this subset requires multiple large SRA datasets (examples are provided in Additional file 1). We also note that many of the GC-rich stretches are predicted to form DNA quadruplex structures [9]. We can only speculate about the biological determinants behind the presence of the GC-rich sequence patterns. In the genes we have analyzed here, these sequence patterns appear to be conserved in birds but not in other vertebrates. The best example is EPO, where we were able to assemble orthologs in several bird species from a wide variety of avian taxons. All avian EPO sequences cluster together, while the mammalian and other non-avian EPO orthologs have lower GC content (Fig. 1b and Additional file 4). Therefore, the events leading up to this change in EPO sequence composition must have occurred in a common ancestor of birds, or there must have been some driving force maintaining this pattern throughout avian evolution. A similar evolutionary trend can be observed in POP7 (which lies next to EPO in vertebrate genomes), EPOR, its genomic neighbor SWSAP1, and other GC-rich genes reported here (e.g. MMP14 and LPPR2, as shown in Fig. 1b). For these genes, we had only a very limited amount of sequences from outside their coding regions, so their position on avian chromosomes could not be determined. An intriguing possibility is that at least some of these genes reside on avian microchromosomes. The six smallest chicken microchromosomes (chromosomes 33-38) do not have any sequence representation in the chicken genome assembly [10]. Sequence information for the larger chicken microchromosomes is also fragmentary; they have, however, been reported to have higher GC content than macrochromosomes [11, 12]. In addition, avian microchromosomes contain various types of short microsatellite repeats [1316]. The extensive presence of these repeats is a typical feature that we observe in introns in the GC-rich gene subset.


We report the existence of avian genes with strongly biased GC patterns. These genes have been underrepresented in genomic databases, probably due to technical obstacles to genomic library preparation. In addition to identifying chicken EPO and EPOR loci, we analyzed the gene set reported as missing in birds [1] and found additional examples of such genes. Our examination of the genes listed in Lovell et al. [1] was not exhaustive, so among the avian genes absent from current databases several more can be expected to be assembled from SRA data. Nevertheless, the vast majority of the genes reported in Lovell et al. [1] are probably really missing in birds, and their article includes a detailed discussion of the evolutionary aspects of this phenomenon. The existence of an underrepresented GC-rich gene subset was originally suggested in the 2004 report on the chicken genome sequence [12]. Here, we present detailed examples of such genes, which present an analytical challenge from both technical and evolutionary perspectives.



Next-generation sequencing


National Center for Biotechnology Information


Sequence read archive


Erythropoietin receptor





This work was supported by program NÁVRAT (LK11215) and NPU I (LO1419) provided by the Czech Ministry of Education, Youth and Sports. Access to computing and storage facilities provided by ELIXIR CZ and the National Grid Infrastructure MetaCentrum, administered under the programme Projects of Large Infrastructure for Research, Development, and Innovations (LM2010005), is greatly appreciated.

Authors’ Affiliations

Institute of Molecular Genetics, Academy of Sciences of the Czech Republic, Vídeňská 1083, 14220 Prague, Czech Republic


  1. Lovell PV, Wirthlin M, Wilhelm L, Minx P, Lazar NH, Carbone L, et al. Conserved syntenic clusters of protein coding genes are missing in birds. Genome Biol. 2014;15:565.PubMed CentralView ArticlePubMedGoogle Scholar
  2. Cai Q, Qian X, Lang Y, Luo Y, Xu J, Pan S, et al. Genome sequence of ground tit Pseudopodoces humilis and its adaptation to high altitude. Genome Biol. 2013;14:R29.PubMed CentralView ArticlePubMedGoogle Scholar
  3. Beug H, Steinlein P, Bartunek P, Hayman MJ. Avian hematopoietic cell culture: in vitro model systems to study oncogenic transformation of hematopoietic cells. Methods Enzymol. 1995;254:41–76.View ArticlePubMedGoogle Scholar
  4. Dolznig H, Bartunek P, Nasmyth K, Mullner EW, Beug H. Terminal differentiation of normal chicken erythroid progenitors: shortening of G1 correlates with loss of D-cyclin/cdk4 expression and altered cell size control. Cell Growth Differ. 1995;6:1341–52.PubMedGoogle Scholar
  5. Hayman MJ, Meyer S, Martin F, Steinlein P, Beug H. Self-renewal and differentiation of normal avian erythroid progenitor cells: regulatory roles of the TGF alpha/c-ErbB and SCF/c-kit receptors. Cell. 1993;74:157–69.View ArticlePubMedGoogle Scholar
  6. Schroeder C, Gibson L, Nordstrom C, Beug H. The estrogen receptor cooperates with the TGF alpha receptor (c-erbB) in regulation of chicken erythroid progenitor self-renewal. EMBO J. 1993;12:951–60.PubMed CentralPubMedGoogle Scholar
  7. Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011;12:R18.PubMed CentralView ArticlePubMedGoogle Scholar
  8. Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, et al. Characterizing and measuring bias in sequence data. Genome Biol. 2013;14:R51.PubMed CentralView ArticlePubMedGoogle Scholar
  9. Menendez C, Frees S, Bagga PS. QGRS-H Predictor: a web server for predicting homologous quadruplex forming G-rich sequence motifs in nucleotide sequences. Nucleic Acids Res. 2012;40:W96–W103.PubMed CentralView ArticlePubMedGoogle Scholar
  10. Griffin D, Burt DW. All chromosomes great and small: 10 years on. Chromosome Res. 2014;22:1–6.View ArticlePubMedGoogle Scholar
  11. Costantini M, Di Filippo M, Auletta F, Bernardi G. Isochore pattern and gene distribution in the chicken genome. Gene. 2007;400:9–15.View ArticlePubMedGoogle Scholar
  12. International Chicken Genome Sequencing C. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004;432:695–716.View ArticleGoogle Scholar
  13. Deryusheva S, Krasikova A, Kulikova T, Gaginskaya E. Tandem 41-bp repeats in chicken and Japanese quail genomes: FISH mapping and transcription analysis on lampbrush chromosomes. Chromosoma. 2007;116:519–30.View ArticlePubMedGoogle Scholar
  14. Ishishita S, Tsuruta Y, Uno Y, Nakamura A, Nishida C, Griffin DK, et al. Chromosome size-correlated and chromosome size-uncorrelated homogenization of centromeric repetitive sequences in New World quails. Chromosome Res. 2014;22:15–34.View ArticlePubMedGoogle Scholar
  15. Krasikova A, Fukagawa T, Zlotina A. High-resolution mapping and transcriptional activity analysis of chicken centromere sequences on giant lampbrush chromosomes. Chromosome Res. 2012;20:995–1008.View ArticlePubMedGoogle Scholar
  16. Shang WH, Hori T, Toyoda A, Kato J, Popendorf K, Sakakibara Y, et al. Chickens possess centromeres with both extended tandem repeats and short non-tandem-repetitive sequences. Genome Res. 2010;20:1219–28.PubMed CentralView ArticlePubMedGoogle Scholar
  17. Rosenbloom KR, Armstrong J, Barber GP, Casper J, Clawson H, Diekhans M, et al. The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 2015;43:D670–681.PubMed CentralView ArticlePubMedGoogle Scholar


© Hron et al. 2015

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.