Identification of novel exons and transcribed regions by chimpanzee transcriptome sequencing
© Wetterbom et al.; licensee BioMed Central Ltd. 2010
Received: 15 December 2009
Accepted: 23 July 2010
Published: 23 July 2010
We profile the chimpanzee transcriptome by using deep sequencing of cDNA from brain and liver, aiming to quantify expression of known genes and to identify novel transcribed regions.
Using stringent criteria for transcription, we identify 12,843 expressed genes, with a majority being found in both tissues. We further identify 9,826 novel transcribed regions that are not overlapping with annotated exons, mRNAs or ESTs. Over 80% of the novel transcribed regions map within or in the vicinity of known genes, and by combining sequencing data with de novo splice predictions we predict several of the novel transcribed regions to be new exons or 3' UTRs. For approximately 350 novel transcribed regions, the corresponding DNA sequence is absent in the human reference genome. The presence of novel transcribed regions in five genes and in one intergenic region is further validated with RT-PCR. Finally, we describe and experimentally validate a putative novel multi-exon gene that belongs to the ATP-cassette transporter gene family. This gene does not appear to be functional in human since one exon is absent from the human genome. In addition to novel exons and UTRs, novel transcribed regions may also stem from different types of noncoding transcripts. We note that expressed repeats and introns from unspliced mRNAs are especially common in our data.
Our results extend the chimpanzee gene catalogue with a large number of novel exons and 3' UTRs and thus support the view that mammalian gene annotations are not yet complete.
It is generally believed that comparisons at the genome and transcriptome levels are powerful strategies towards understanding the molecular differences that underlie the phenotypic divergence between humans and chimpanzees. Since the time of divergence, approximately 6 million years ago , the two species have acquired changes in both their genomes and their transcriptomes. The draft chimpanzee genome sequence  provided new opportunities to study primate biology and to understand the speciation process. Mikkelsen et al.  presented a comprehensive analysis of the chimpanzee genome and a comparative analysis with the human genome, but did not address the complexity of the transcriptome. Studies of chimpanzee transcription have been performed primarily with microarrays, covering both coding [3–8] and noncoding regions  of the genome. Due to the sequence divergence between the species, the use of microarrays can be problematic when arrays based on human sequence data are employed to study the chimpanzee transcriptome. To circumvent this problem, custom-made arrays with species-specific probes have been used [10, 11]. Notwithstanding, all targeted expression arrays are based on a priori assumptions regarding the expressed parts of the genome, and are therefore not suitable for unbiased studies of transcription and discovery of novel transcripts.
Direct detection of both known and novel transcripts and exons can be achieved by complete sequencing of the cDNA population. This strategy was used by Sakate et al. , who used Sanger sequencing of cloned cDNA libraries to assemble the 5' end of 226 protein-coding chimpanzee genes and later to describe the full-length cDNAs from 87 protein-coding genes . However, Sanger sequencing is not suitable for capturing the full complement of the chimpanzee transcriptome. The advent of second-generation sequencing techniques now makes it possible to directly sequence the RNA populations (RNA-Seq) to an unprecedented depth, providing information on both which genomic regions are transcribed and their expression levels. This technology has been successfully applied to eukaryotes, including yeast [14, 15], Caenorhabditis elegans , mouse  and human [18–20]. More recently, transcriptome profiling has been employed for comparative studies of human, chimpanzee and rhesus macaque [21, 22]. Blekhman et al.  used RNA-Seq to identify a large number of genes in liver, where the expression levels appear to be under natural selection in primates. They also identified a group of genes with similar expression levels between species but with a sexually dimorphic expression pattern. In another study, Babbitt et al.  used Tag-sequencing [23, 24] to survey the coding and noncoding transcriptome in frontal cortex. Their results show that in addition to protein-coding genes, a group of noncoding transcripts is also conserved between the species.
Transcriptome studies usually rely on different types of annotations to determine where genes and other functional elements are located within the genome. Such gene predictions can be based either on the DNA sequence itself or on alignments of mRNAs and/or ESTs . For chimpanzee, few expression data are available and gene models have therefore been based on evidence from human annotations. This results in a homocentric view that has previously made it hard to detect expression of chimpanzee-specific transcripts. However, using RNA-Seq it is now possible to measure gene expression and capture the diversity of the chimpanzee transcriptome in an unbiased way. Transcriptome sequencing of human HapMap cell lines  indicates that even for well-annotated genomes, it is possible to identify many novel transcripts. Pickrell et al.  describe almost a thousand novel transcribed regions (TRs) that appear to be part of existing human gene models. Many of these regions are spliced to annotated exons and most regions were putative new UTRs rather than protein-coding exons. Based on these findings we expected also to find a large number of novel TRs in the chimpanzee genome.
Here we report the results from sequencing of the chimpanzee transcriptome in brain (frontal cortex) and liver at higher coverage than previous studies [21, 22]. Our samples are unique, originating from infant chimpanzees, and thus the results provide an important complement to studies of adult animals. Using the SOLiD platform, we generated over 500 million reads from the brain and liver of two chimpanzees. The sequence data were used to quantify expression of known genes and to identify novel TRs, some of which appeared to be absent from the human genome. In these analyses we assessed differences both between the tissue types and between individuals. Furthermore, by combining the novel TRs with de novo splice predictions we were able to detect numerous uncharacterized exons and 3' UTRs that extended existing gene models. Using this strategy we also identified a putative novel member of the ATP-cassette transporter gene family, located on chromosome 16. These novel exons and the new gene model were not included in human transcript databases, thereby suggesting that they may account for some of the uncharacterized variation between the species. In conclusion, our experimental approach enabled us to create a comprehensive catalogue of transcribed elements across tissues and individuals.
Sample preparation, sequencing and mapping of reads
Mapping summary for all six SOLiD runs
Total number of reads
Uniquely aligned reads
Based on the mapped reads we constructed a coverage signal profile across the chimpanzee genome. To minimize the effect of pileups of identical reads at some genomic positions, which may lead to overestimated gene expression levels, we computed the coverage signal exclusively from reads with unique starting points. This is a conservative approach and may lead to reduced dynamic range for studying gene expression. A common problem in RNA-Seq analyses is an uneven representation of the 3' and 5' ends of the mRNA [17, 27]. To evaluate this aspect of the data, as well as our ability to detect known coding regions, we used the coverage signal of all chimpanzee RefSeq genes  and computed the average coverage for all exons and introns with different rank (for example, last exon, second to last exon and so on; Figure S1 in Additional file 1). Although the coverage signals agreed very well with the location of RefSeq exons, we observed a bias towards more reads in the 3' end of the genes. This bias was most likely due to incomplete reverse transcription of the template RNA from the oligo(dT) priming used in the first-strand cDNA synthesis.
Quantifying gene expression and detecting transcribed regions
Number of expressed RefSeq genes and transcribed regions
Number of expressed RefSeq genes
Total number of TRs
Number of novel TRs
The same expression level thresholds were then used for de novo detection of TRs across the genome, not limiting the analysis to predefined gene annotations. To reduce the frequency of false positives, we required each region to have at least 50 bp consecutively above the cut-off. Additionally, we required the regions to be expressed in the same tissue of both animals. A total of 116,075 TRs were detected in brain and 61,920 in liver (Table 2), including both regions overlapping with RefSeq genes and TRs outside of previous annotations. The coordinates for all TRs are available in Additional file 2 and can be uploaded and viewed in the UCSC Genome Browser [29, 30].
Localization and expression of transcribed regions
The gene expression levels (measured in dcpm) for different genomic locations are shown in Figure 4b. The exonic regions had the highest dcpm and there was only a small difference in expression levels between terminal and all other exons, although more TRs were found in terminal exons counted in absolute numbers. The second highest dcpm was seen for regions within 10 kb downstream of RefSeq genes, followed by the categories 10 kb upstream, intergenic and finally intronic. Novel TRs had lower average dcpm levels than annotated TRs in the same genomic locations, thus emphasizing the potential of deep sequencing to identify TRs that have previously escaped detection due to lower transcription levels.
Comparing expression levels in frontal cortex and liver
In addition to expression of known genes, we also identified a large number of novel TRs. The vast majority (84%) of novel TRs mapped within RefSeq introns or within 10 kb upstream or downstream of RefSeq genes. As a comparison, these extended RefSeq loci covered approximately half of the genome, and thus there was a clear enrichment of TRs in these regions. The tissue distribution of genes containing novel TRs is displayed in Figure 5b. In comparison to the results for all expressed genes (Figure 5a), a larger proportion of genes with novel TRs was found in brain than in liver. Furthermore, we noted that the overlap between tissues was lower for genes containing novel TRs than for expressed genes in general, indicating that novel TRs have a higher degree of tissue specificity. A Gene Ontology analysis showed that genes harboring novel TRs belonged to similar biological processes as was found for expressed genes in general (Table S3 in Additional file 1).
Comparing expression levels between individuals
Although most genes were detected in both chimpanzees for each respective tissue, we observed some differences between the individuals. A total of 14,301 genes were detected in the frontal cortex samples and 80% (n = 11,319) of these were found in both chimpanzees. The corresponding figures for liver are 12,653 genes in total, with 70% (n = 8,810) shared between both individuals. Levels of gene expression were highly correlated between individuals, in both brain and liver (Figure 2c,d). We also noted that genes with individual-specific expression generally had lower expression levels than genes detected in both chimpanzees, indicating that the differences between samples is, to some extent, a result of the sequencing depth. In contrast to known genes, where a large proportion was shared between individuals, novel TRs were not shared to the same extent. Of the novel TRs, only 14% and 11% were common to both individuals in brain and liver, respectively. This reflects the fact that novel TRs were generally expressed at lower levels, and in parallel to known genes, regions with lower expression level were shared to a lesser extent between the two chimpanzees.
Further characterization and validation of novel transcribed regions
We required TRs to be present in both chimpanzees and using this criterion we identified 116,075 TRs in brain and 61,920 in liver. TRs were often found in clusters, thus indicating that closely spaced TRs originated from the same transcript and with increasing sequencing depth many of the TRs are likely to merge into longer transcripts. Seventeen percent (19,446) of the TRs in brain and 10% (6,496) of those in liver did not overlap any previously annotated exons, mRNAs or ESTs. Such TRs were considered novel TRs and analyzed further to elucidate their origin and function. Current gene annotations in chimpanzee are almost exclusively based on transcriptional data originating from human and this implies that the novel TRs in our study have not been previously detected in human. The explanation for this may be either that the novel TRs are absent in human tissues or that they are expressed at low levels and have thus escaped detection.
Novel transcribed regions absent from the human genome
A subgroup of novel TRs could not be mapped to the human genome sequence, indicating that the DNA-sequence has been lost in human or gained in chimpanzee. Starting with the complete dataset of all novel TRs (19,446 in brain and 6,496 in liver), we selected all regions where the coordinates could not to be translated to the human genome (hg19). For these candidates, we BLASTed  the transcribed chimpanzee sequences to the entire human genome and selected TRs that did not give a significant match. This resulted in 285 novel TRs in brain and 77 in liver, which were not present in the human genome sequence. Such novel TRs were located both in the vicinity of known genes (that is, intronic or within 10 kb upstream or downstream) and in intergenic regions. The genomic regions that were absent in the human genome have either been lost in the human lineage or gained in the chimpanzee lineage and to deduce the evolutionary history we used the macaque genome as an outgroup. Within this subset of novel TRs approximately half (n = 133 in brain and n = 39 in liver) could not be located in either the human or the macaque genomes, thus indicating sequence gain in the chimpanzee lineage. For the remaining part of novel TRs the region was found both in the chimpanzee and macaque genomes and such regions have most likely been lost in the human lineage.
Novel exons and UTRs
We aimed specifically at identifying TRs that represented novel exons and UTRs. It is expected that many such TRs will extend known gene annotations and thus the TRs will be found close to known genes. This was supported by the finding that 84% of the novel TRs mapped in the vicinity of a RefSeq gene (that is, intronic or within 10 kb upstream or downstream; see Figure S2a,b in Additional file 1 for genomic distribution of novel TRs). This is an appreciable enrichment since the extended RefSeq regions covered only 52% of the genome. A proportion of these novel TRs probably represent novel exons or UTRs. Considering the sequencing bias, we expected to find mainly 3' UTRs and only to a lesser extent 5' UTRs.
Initially, we examined the length of novel TRs and found it to vary between 50 and 2,500 bp, with a median of 142 bp for brain and 151 bp for liver. The length distribution was plotted together with the lengths of RefSeq exons (Figure S3 in Additional file 1). The peak around the TR mean coincided with a similar peak in the length distribution for RefSeq exons, and the extended tail of longer novel TRs resembled the distribution of terminal exons (including annotated 3' UTRs). This comparison suggested that our collection of novel TRs was composed of a mixture of exons and 3' UTRs.
To further explore the genomic arrangement of novel TRs and to what extent they were connected to known gene annotations, we used the SplitSeek program  to perform a de novo search for splice junctions. Since the SplitSeek algorithm is not applicable for the shorter read lengths, we used only the 50-bp reads from the brainF and liverF samples in this analysis (see Materials and methods for details). A total of 1,904 splice junctions were predicted in the brainF sample and 4,279 in liverF, with as many as 90% of them mapping exactly to an exon-exon boundary in a known gene. The remaining 10% are the most interesting, since they include previously undetected splicing events connecting novel TRs to each other and to known gene models.
Other types of novel transcripts
Percentage of novel transcribed regions that overlap with different repeat classes
All other repeat classes
Novel TRs in brain
Novel TRs in liver
Experimental validation of novel transcribed regions
We used reverse transcriptase PCR (RT-PCR) to verify the existence of novel TRs in seven example regions, including five genes, one intergenic region and a putative novel protein-coding gene. For novel TRs adjacent to genes, we also designed primers to amplify a fragment from the previously annotated gene, to act as a positive control and as a baseline for comparing the expression levels of novel TRs. The examples chosen for validation were selected to represent both intergenic and genic TRs. The expression levels of the validated TRs, as determined from the sequencing data, varied from just above the threshold to highly expressed regions. We were able to validate all the five novel TRs within genes, as well as the intergenic TR and for the putative novel gene we were able to validate the majority of predicted exons and junctions. The high success rate indicated that the threshold for expression was very conservative and that there exists additional RNAs that are not captured by our analyses.
In general, we observed that novel TRs had lower expression levels than the positive controls from the same gene (as judged from the intensity of the fragment on the gel). This agrees well with the results presented in Figure 4b where novel TRs have lower expression levels than annotated TRs in the same region. Furthermore, we noted that in NDUFA7 (Figure S6 in Additional file 1), PRDM5 (Figure S7 in Additional file 1) and UROS (Figure 8 in Additional file 1) the expression level of the control was similar in brain and liver, whereas the expression levels of the novel TRs differed between the two tissues. These results support the notion that novel TRs are tissue specific to a larger extent than transcripts in general. Some novel TRs were only detected in a single tissue, based on the threshold used to define transcription, but the region was amplified with RT-PCR in both tissues. One such example was the KNG1 gene, where the novel TR was initially only predicted in liver but we were able to amplify the region also in brain (although with a significantly weaker band on the gel). This further suggests that not all low abundance TRs are captured with our threshold for transcription.
Characterization of a novel chimpanzee gene
Searching the database of RefSeq proteins suggested that the predicted protein belonged to the ATP-binding cassette gene family. The highest similarity score was found for the mouse protein Abca15 (NP_796187.2), with 90% our predicted gene covered in the alignment and 73% identity among the aligned amino acids (see Supplementary material in Additional file 1). In addition to the eight predicted exons there was also another TR just downstream of the fourth exon. This region was expressed at significantly lower levels and inclusion of this TR disrupted the open reading frame and thus it has not been added to the proposed gene model. We used RT-PCR and validated six out of seven possible pairs of consecutive exon-exon junctions; three exon-exon junctions were also supported by predicted splice junction.
The novel gene model was further supported by EST evidence from GenBank [38, 39]. We found macaque ESTs covering exons 2 to 6 in our model, and human ESTs covering exons 3, 4 and 6. We noted that the fifth exon in our gene model was skipped in the human ESTs and, as seen in the human alignment track in Figure 9, a sequence of 312 bp encompassing exon 5 has been rearranged to a different location on the same chromosome (located more than 500 kb upstream of the other exons). Removal of exon 5 in human causes a shift in the reading frame and introduces premature termination codons in the predicted protein sequence. Since the novel predicted gene appeared to be transcribed in chimpanzee and other mammals, but not in full in human, we speculate that it has become a pseudogene in human.
We have sequenced the chimpanzee transcriptome in frontal cortex and liver, and with more than 200 million uniquely mapping reads this is the most comprehensive dataset to date, providing a rich source for validation of existing RefSeq genes and for identification of novel TRs. Our aim was to provide a comprehensive catalogue of transcripts from protein-coding loci and to achieve this we enriched for putative coding RNAs by using oligo(dT) primers for reverse transcription. To establish a threshold for expression we compared the expression values of RefSeq genes with random intergenic regions and the threshold was set to exclude 95% of the random sequences. This approach rests on the premise that the majority of the genome is only transcribed at a low level and although this issue is still not completely resolved, a recent RNA-Seq study suggests that most of the genome is not appreciably transcribed . Using this strategy we detected 11,319 expressed genes in brain and 8,810 in liver. The numbers were slightly lower than other RNA-Seq studies [21, 22] of chimpanzee, indicating that our threshold was conservative. This was further suggested by the RT-PCR validation where we were able to amplify TRs with expression below the threshold. However, with the choice between finding all novel transcripts and minimizing the number of false positives, we prioritized the latter.
Having shown that we could reliably detect expression of known genes, we used the same expression threshold for de novo detection of TRs across the chimpanzee genome. A large number of TRs were found (over 150,000 in the two tissues) and given that many were located close together we assume that with increasing sequencing depth many TRs would merge into longer consecutive transcripts. However, to avoid introducing an arbitrary distance cut-off we did not merge nearby TRs in the analyses. TRs that were not overlapping with annotations of human or chimpanzee RefSeq exons, mRNAs or ESTs were termed 'novel', implying that the region has not previously been annotated as transcribed. Based on the oligo(dT)-priming employed for cDNA synthesis, we expected that most novel TRs would originate from polyadenylated transcripts and this was supported by the fact that 84% of the novel TRs mapped within the boundaries of RefSeq genes (that is, intronically or within ± 10 kb), thus suggesting a multitude of uncharacterized exons and UTRs. It should be noted, however, that not all exons and UTRs are necessarily protein coding. This was shown, for example, by the manual annotation of the ENCODE regions , which found that an appreciable proportion of transcripts from genic loci appear to be noncoding.
Novel TRs may also stem from several other types of noncoding transcripts, including expressed repeats, retained introns in unspliced mRNAs, different types of small ncRNAs or antisense transcripts. Expressed repeats and unspliced mRNAs were quite commonly observed in our data and other types of ncRNAs were not expected to be present in large numbers, considering the oligo(dT)-priming of the cDNA synthesis. Widespread transcription outside protein-coding regions in chimpanzees has been observed by Khaitovich et al.  and Babbitt et al.  and much of it appears to be conserved in primates, thus suggesting its functional importance. To address these issues in the present samples, sequencing of total RNA, with protocols providing strand-specificity, will ideally be employed.
For a small subset of approximately 350 novel TRs, the corresponding genomic sequence appeared to be absent in human. Further comparison with the macaque genome revealed that for approximately half of the novel TRs, the corresponding genomic regions had been lost in human whereas for the other half the region had been inserted into the chimpanzee genome. We have experimentally verified one such transcript where the corresponding genomic region was absent in the human genome. Since the region was also present in the macaque genome, the most plausible explanation is that the sequence was lost on the human lineage. Structural variation leading to species-specific genomic regions in both human and chimpanzee has been described by several groups, ranging from short insertions and deletions  to large segmental duplications . If such genomic regions harbor transcriptionally active loci, they will contribute to the transcriptional diversity between the species. Thus, the group of TRs described here is intriguing and future comparative studies will have to determine whether they provide important clues to the divergent phenotypes between humans and chimpanzees.
Alternative splicing is an important mechanism for generating transcript divergence, and to assess splicing patterns in our data we used the SplitSeek software for de novo predictions of splice junctions. The advantage with this method is that we were able to predict exon-exon junctions that have not been described previously, in contrast to many other RNA-Seq studies that rely on methods based on predefined junction libraries. We predicted approximately 200 novel splice junctions in brain and approximately 400 in liver and this is likely to represent only a fraction of the splicing divergence. Due to the nature of our transcript data many of these novel splice junctions were found in 3' regions of genes. The difference in the number of splice junctions between brain and liver most likely reflects the higher sequence coverage in the liver sample rather than a biological difference. The expression of different transcriptional isoforms is known to vary between human tissues  and our analyses indicated that this was also true in chimpanzee. Tissue-specificity of noncoding transcripts has also been described previously in human  and our results confirmed that the same applied in the chimpanzee. Genes with novel TRs were expressed in a more tissue-specific manner than genes with no novel TRs. Furthermore, we noted several examples where the previously annotated parts of a gene were expressed at the same level in brain and liver, whereas the expression levels of novel TRs differed between the tissues. Bearing in mind that the threshold for expression was conservative, it is quite plausible that we did not capture all low abundance transcripts and thus the proportion of tissue-specific TRs may be even higher.
The samples used for sequencing originate from two infant chimpanzees and this makes the dataset unique and an important complement to previous datasets from frontal cortex  and liver  of adult individuals. Somel et al.  have previously shown that the brain transcriptome in primates is remodeled after birth and that gene expression differs with age. Although there is no parallel study of developmental changes in the liver transcriptome, it is reasonable to assume that considerable postnatal transcriptional changes occur also in the liver. The age difference is also one plausible explanation of why we were unable to replicate the sexually dimorphic gene expression observed by Blekhman et al. .
Gene annotations in chimpanzee have relied almost entirely on human mRNAs and ESTs as supporting evidence. However, not all genes are identical between the species and by using current human-centered gene annotations it is only possible to find genes where regions are present in human and absent in chimpanzee, and not the opposite. Our analyses revealed a large number of novel TRs located in the proximity of RefSeq genes and thereby extends existing annotations of as many as 9,826 genes (Figure 5b). We have a bias with higher sequencing depth towards the 3' end of transcripts, and this gives us the possibility to characterize novel 3' UTRs, even in transcripts with very low expression levels. 3' UTRs have a role in post-transcriptional regulation of gene expression, stability of the transcript and as possible targets for microRNAs. An illustrative example is the UROS gene (Figure 8), where we identified and validated an alternative 3' UTR that was longer than the RefSeq annotated UTR. We also identified and experimentally validated examples of putative novel exons from five genes (Figures S4, S5, S6 and S7 in Additional file 1). Such exons could possibly add novel coding regions and thereby alter the amino acid composition of the resulting protein. Similar results, with a large number of uncharacterized exons and UTRs, have been reported in human  and our results support these findings. Finally, we describe an example of a putative novel protein-coding gene that belongs to the ATP-binding cassette transporter gene family. The gene appeared to be present in chimpanzee and several other primates but part of the DNA sequence was absent from the human genome, thus suggesting that the gene is not functional in human. The novel exons, 3' UTRs and the putative gene described here are not present in the current annotations of chimpanzee genes and thus clearly show the weakness of existing homocentric gene catalogues. Taken together, our results point to a great transcriptional diversity in chimpanzee that has not been previously characterized.
We have sequenced the chimpanzee transcriptome in frontal cortex and liver and provided a comprehensive catalogue of expressed RefSeq genes and numerous novel TRs. The vast majority of novel TRs was found within or in the vicinity of known genes and thus extends existing gene models, mainly by adding new exons and 3' UTRs. Furthermore, we have provided evidence of a gene that appears to have been lost in the human lineage. Our analyses highlight the great potential of combining RNA-Seq with splice junction predictions in order to generate a more complete understanding of transcriptome diversity.
Materials and methods
Tissue samples from liver and frontal cortex were obtained through autopsies of two young chimpanzees (one female and one male), from the Kolmården Zoo, Sweden. The deep frozen tissues were cryo-sectioned and the slices recovered in QIAzol lysis buffer (Qiagen, Valencia, CA, USA). Ten 10-μm slices were used for the liver samples and 15 slices for frontal cortex. The miRNeasy (Qiagen) protocol for small tissue samples was used to prepare total RNA, which was subsequently treated with DNAse (Qiagen) to remove possible contamination from genomic DNA and further purified using the mini-elute kit (Qiagen). First strand cDNA synthesis was performed with the SuperSMART PCR cDNA synthesis kit (Clontech, Mountain View, CA, USA), using the protocol supplied by the manufacturer. The method enriches for full-length cDNAs by using specific oligomers for priming. A poly(A)-specific primer initiates the first strand synthesis of cDNA, thereby selecting for polyadenylated RNA while simultaneously keeping the concentration of ribosomal RNA low. The resulting single-stranded cDNA was amplified with the Advantage2 PCR kit (Clontech) using 27 amplification cycles. The amplified cDNA was cleaved with Rsa1 (NewEngland BioLabs, Ipswich, MA, USA) to partially remove the added oligonucleotide and 3' primer. The short cleavage products were eliminated in the size selection step in the library preparation preceding sequencing.
Sequencing and mapping of reads
The cDNA was used to prepare a fragment library from each sample, according to the protocol supplied by the manufacturer (Applied Biosystems, Carlsbad, CA, USA) and sequenced using the SOLiD platform (Applied Biosystems). First, all four libraries were sequenced with a read length of 35 bp, brainF and brainM on a whole slide each, and liverF and liverM on half a slide each. Second, the libraries for brainF and liverF were sequenced with a read length of 50 bp, using a quarter of a slide per library. Sequencing reads have been deposited in the European Nucleotide Archive at EMBL with the accession number [EMBL: ERA000160].
Reads were aligned using the 'anchor-extend' method in the whole transcriptome analysis tool from Applied Biosystems . In the alignment each read was split into two parts, or 'anchors' of length 25, which were then mapped to the chimpanzee March 2006 (panTro2) reference sequence. Each anchor that mapped uniquely to the reference was then extended as long as it was still matching the reference. This 'anchor-extend' approach makes it possible to map reads that partly overlap with a splice junction.
When computing the coverage signal, all reads that mapped to the exact same position and on the same strand were merged and only counted once. In this way we reduced the experimental bias caused by uneven PCR amplification of transcripts. The brainF and liverF libraries were sequenced twice with different read length (35 bp and 50 bp). Since there was a very high correlation between the replicates with 35 and 50 bp (Figure 2a,b), they were merged into one dataset per tissue.
Quantification of expression levels of RefSeq genes
For each of the samples brainM, brainF, liverM and liverF, the coverage (that is, number of overlapping reads) was calculated for every position of the reference genome. To annotate genes, we used the 'Chimpanzee and human RefSeq genes' track in the UCSC Genome Browser [29, 30]. This track defines genes based on the alignment of RefSeq  mRNAs. Thus, different isoforms from the same genomic locus will be annotated as different genes although they have the same common gene name (HGNC symbols).
To quantify expression levels, we calculated the average dcpm value  in the last 500 bp of the last exon. The average dcpm value was based on the whole last exon if it was shorter than 500 bp. In this way, each RefSeq gene was assigned a unique expression value. In order to establish a cut-off for expression of genes, we used the following strategy. We compared the expression levels, calculated as above, with the average dcpm values of randomly sampled regions outside of RefSeq exons. For each RefSeq gene, we sampled one random non-exonic region of the same length as was used for calculating the gene expression, that is, at most 500 bp. The sampling was done from the same chromosome as the gene was localized on and in this way we could compute average dcpm values for the same number of randomly sampled regions as the number of RefSeq genes. Assuming that most of the genome is transcribed only at very low levels , we then applied a cut-off so that 95% of the genes/TRs above the threshold were detected as transcribed and only 5% of the random regions. This is the same as having a 5% false discovery rate. However, we expect that some of the randomly sampled regions may represent biological transcripts, rather than randomly mapped reads, and thus the true false discovery rate is lower than 5%. Since the expression levels differed between the samples, the procedure was repeated once for each dataset, resulting in different cut-offs. After having established a cut-off for each of the samples, we used this to scan for novel TRs. All regions with a dcpm signal above the threshold over a stretch of at least 50 consecutive bases were considered to be transcribed, but to be included in the dataset of TRs we further required detection in the same tissue in both individuals.
De novodiscovery and annotation of transcribed regions
The expression threshold described above was subsequently used to define TRs across the entire genome. The complete set of TRs was compared to the RefSeq, mRNA and EST tables from the UCSC Genome Browser [29, 30], containing information about genes, mRNAs or ESTs detected either in chimpanzee or in human. We then selected the TRs that did not overlap with any of the RefSeq genes, mRNAs or ESTs and such TRs were considered to be 'novel'. To increase the confidence in these novel TRs, we required them to be > 50 bp in length and to be present (that is, overlap at least one base pair) in the same tissue of both individuals.
Detection of novel regions with no support in the human genome
The coordinates for all novel TRs were translated from the chimpanzee genome (panTro2) to the human genome (hg19) using the liftOver application available in the UCSC Genome Browser [29, 30]. Regions that did not translate were BLASTed  to the entire human genome (including chromosomes labeled as random and unknown) and we then filtered out the ones that did not give any match (with e-value < 0.001). For the evolutionary analysis we used the liftOver tool to further translate the novel TRs to the macaque genome (rheMac2).
Gene Ontology analysis
The DAVID functional annotation tool  was used to perform Gene Ontology classification of various lists of RefSeq genes. We generally used all genes in the genome as a background, focused on the ontology of biological processes and considered a corrected (Benjamini) P-value < 0.01 to be significant.
Splice junction detection
To detect splice junctions, we analyzed the reads using the version 1.3.2 of the SplitSeek algorithm , available from the SOLiD software development community . Since the SplitSeek algorithm is not applicable for the shorter read lengths, we used only the 50-bp reads from the brainF and liverF samples in this analysis. The analysis procedure and parameters were the same as in our previous study on mouse RNA-Seq data . We required each prediction to be supported by at least two uniquely mapping reads, and the two parts of the junction to be separated by at most 1 Mb. All SplitSeek predictions are included in Additional file 2, which can be uploaded and viewed in the UCSC Genome Browser together with all identified TRs in the chimpanzee samples.
Experimental validation of novel transcribed regions
RNA samples extracted from the female brain and liver were used to regenerate cDNAs for RT-PCR, including reactions without RT enzyme. We used the SMARTer™ Pico PCR cDNA synthesis Kit from Clontech (which has replaced the SuperSMART PCR cDNA kit used for library preparation) following the manufacturers instructions. Briefly, 1 μg of DNAase treated total RNA was included in each cDNA synthesis reaction and incubated at 42°C for 90 minutes. The cDNAs were purified using the Nucleospin Extract II kit and recovered in a final volume of 160 μl. Then, 1 μl of each cDNA reaction was used in 25 μl PCR reactions using the Advantage 2 PCR kit. The primers used in amplifications are given in Table S4 in Additional file 1. The amplifications were initiated by 1 minute incubation at 95°C followed by 35 cycles at 95°C for 15 s, 57 to 60°C for 30 s and 68°C for 30 s. The 10 × Advantage PCR buffer SA was used for primer pairs associated with MN1. We separated 10 μl of the PCR reactions using 2 to 3% NuSieve agarose gels.
depth of coverage per million reads
expressed sequence tag
long terminal repeat
reverse transcriptase PCR
We thank Bengt Röken at the Kolmården Zoo (Sweden) for sharing chimpanzee samples, Tomas Bergström for sample collection and the pathologists at our department for help with tissue handling. We also acknowledge the staff at Uppsala Genome Center who performed the SOLiD sequencing. Financial support for this project has been obtained from the Swedish Natural Sciences Research Council (UG), The Knut and Alice Wallenberg Foundation (UG, LC) and The National Graduate School for Functional Genomics and Bioinformatics (AW).
- Chen FC, Li WH: Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am J Hum Genet. 2001, 68: 444-456. 10.1086/318206.PubMedPubMed CentralView ArticleGoogle Scholar
- Mikkelsen TS, Hillier LW, Eichler EE, Zody MC, Jaffe DB, Yang S, Enard W, Hellmann I, Lindblad-Toh K, Altheide TK: Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005, 437: 69-87. 10.1038/nature04072.View ArticleGoogle Scholar
- Khaitovich P, Hellmann I, Enard W, Nowick K, Leinweber M, Franz H, Weiss G, Lachmann M, Paabo S: Parallel patterns of evolution in the genomes and transcriptomes of humans and chimpanzees. Science. 2005, 309: 1850-1854. 10.1126/science.1108296.PubMedView ArticleGoogle Scholar
- Enard W, Khaitovich P, Klose J, Zollner S, Heissig F, Giavalisco P, Nieselt-Struwe K, Muchmore E, Varki A, Ravid R, Doxiadis GM, Bontrop RE, Paabo S: Intra-and interspecific variation in primate gene expression patterns. Science. 2002, 296: 340-343. 10.1126/science.1068996.PubMedView ArticleGoogle Scholar
- Caceres M, Lachuer J, Zapala MA, Redmond JC, Kudo L, Geschwind DH, Lockhart DJ, Preuss TM, Barlow C: Elevated gene expression levels distinguish human from non-human primate brains. Proc Natl Acad Sci USA. 2003, 100: 13030-13035. 10.1073/pnas.2135499100.PubMedPubMed CentralView ArticleGoogle Scholar
- Uddin M, Wildman DE, Liu G, Xu W, Johnson RM, Hof PR, Kapatos G, Grossman LI, Goodman M: Sister grouping of chimpanzees and humans as revealed by genome-wide phylogenetic analysis of brain gene expression profiles. Proc Natl Acad Sci USA. 2004, 101: 2957-2962. 10.1073/pnas.0308725100.PubMedPubMed CentralView ArticleGoogle Scholar
- Somel M, Franz H, Yan Z, Lorenc A, Guo S, Giger T, Kelso J, Nickel B, Dannemann M, Bahn S, Webster MJ, Weickert CS, Lachmann M, Paabo S, Khaitovich P: Transcriptional neoteny in the human brain. Proc Natl Acad Sci USA. 2009, 106: 5743-5748. 10.1073/pnas.0900544106.PubMedPubMed CentralView ArticleGoogle Scholar
- Karaman MW, Houck ML, Chemnick LG, Nagpal S, Chawannakul D, Sudano D, Pike BL, Ho VV, Ryder OA, Hacia JG: Comparative analysis of gene-expression patterns in human and African great ape cultured fibroblasts. Genome Res. 2003, 13: 1619-1630. 10.1101/gr.1289803.PubMedPubMed CentralView ArticleGoogle Scholar
- Khaitovich P, Kelso J, Franz H, Visagie J, Giger T, Joerchel S, Petzold E, Green RE, Lachmann M, Paabo S: Functionality of intergenic transcription: an evolutionary comparison. PLoS Genet. 2006, 2: e171-10.1371/journal.pgen.0020171.PubMedPubMed CentralView ArticleGoogle Scholar
- Gilad Y, Rifkin SA, Bertone P, Gerstein M, White KP: Multi-species microarrays reveal the effect of sequence divergence on gene expression profiles. Genome Res. 2005, 15: 674-680. 10.1101/gr.3335705.PubMedPubMed CentralView ArticleGoogle Scholar
- Blekhman R, Oshlack A, Chabot AE, Smyth GK, Gilad Y: Gene regulation in primates evolves under tissue-specific selection pressures. PLoS Genet. 2008, 4: e1000271-10.1371/journal.pgen.1000271.PubMedPubMed CentralView ArticleGoogle Scholar
- Sakate R, Osada N, Hida M, Sugano S, Hayasaka I, Shimohira N, Yanagi S, Suto Y, Hashimoto K, Hirai M: Analysis of 5'-end sequences of chimpanzee cDNAs. Genome Res. 2003, 13: 1022-1026. 10.1101/gr.783103.PubMedPubMed CentralView ArticleGoogle Scholar
- Sakate R, Suto Y, Imanishi T, Tanoue T, Hida M, Hayasaka I, Kusuda J, Gojobori T, Hashimoto K, Hirai M: Mapping of chimpanzee full-length cDNAs onto the human genome unveils large potential divergence of the transcriptome. Gene. 2007, 399: 1-10. 10.1016/j.gene.2007.04.013.PubMedView ArticleGoogle Scholar
- Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320: 1344-1349. 10.1126/science.1158441.PubMedPubMed CentralView ArticleGoogle Scholar
- Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008, 453: 1239-1243. 10.1038/nature07002.PubMedView ArticleGoogle Scholar
- Hillier LW, Reinke V, Green P, Hirst M, Marra MA, Waterston RH: Massively parallel sequencing of the polyadenylated transcriptome of C. elegans. Genome Res. 2009, 19: 657-666. 10.1101/gr.088112.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
- Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008, 5: 613-619. 10.1038/nmeth.1223.PubMedView ArticleGoogle Scholar
- Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321: 956-960. 10.1126/science.1160342.PubMedView ArticleGoogle Scholar
- Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456: 470-476. 10.1038/nature07509.PubMedPubMed CentralView ArticleGoogle Scholar
- Blekhman R, Marioni JC, Zumbo P, Stephens M, Gilad Y: Sex-specific and lineage-specific alternative splicing in primates. Genome Res. 2009, 20: 180-189. 10.1101/gr.099226.109.PubMedView ArticleGoogle Scholar
- Babbitt C, Fedrigo O, Pfefferle A, Boyle A, Horvatth J, Furey T, Wray G: Both noncoding and protein-coding RNAs contribute to gene expression evolution in the primate brain. Genome Biol Evol. 2010, 2010: 67-79. 10.1093/gbe/evq002.View ArticleGoogle Scholar
- t Hoen PA, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RH, de Menezes RX, Boer JM, van Ommen GJ, den Dunnen JT: Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res. 2008, 36: e141-10.1093/nar/gkn705.View ArticleGoogle Scholar
- Morrissy AS, Morin RD, Delaney A, Zeng T, McDonald H, Jones S, Zhao Y, Hirst M, Marra MA: Next-generation tag sequencing for cancer gene expression profiling. Genome Res. 2009, 19: 1825-1835. 10.1101/gr.094482.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Brent MR: How does eukaryotic gene prediction work?. Nat Biotechnol. 2007, 25: 883-885. 10.1038/nbt0807-883.PubMedView ArticleGoogle Scholar
- Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010, 464: 768-772. 10.1038/nature08872.PubMedPubMed CentralView ArticleGoogle Scholar
- Bainbridge MN, Warren RL, Hirst M, Romanuik T, Zeng T, Go A, Delaney A, Griffith M, Hickenbotham M, Magrini V, Mardis ER, Sadar MD, Siddiqui AS, Marra MA, Jones SJ: Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach. BMC Genomics. 2006, 7: 246-10.1186/1471-2164-7-246.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007, 35: D61-65. 10.1093/nar/gkl842.PubMedPubMed CentralView ArticleGoogle Scholar
- Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M, Smith KE, Rosenbloom KR, Raney BJ, Pohl A, Pheasant M, Meyer LR, Learned K, Hsu F, Hillman-Jackson J, Harte RA, Giardine B, Dreszer TR, Clawson H, Barber GP, Haussler D, Kent WJ: The UCSC Genome Browser database: update 2010. Nucleic Acids Res. 2010, 38: D613-619. 10.1093/nar/gkp939.PubMedPubMed CentralView ArticleGoogle Scholar
- The UCSC Genome Browser. [http://genome.ucsc.edu/]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Ameur A, Wetterbom A, Feuk L, Gyllensten U: Global and unbiased detection of splice junctions from RNA-seq data. Genome Biol. 11: R34-10.1186/gb-2010-11-3-r34.
- Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermuller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR: RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007, 316: 1484-1488. 10.1126/science.1138341.PubMedView ArticleGoogle Scholar
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447: 799-816. 10.1038/nature05874.PubMedView ArticleGoogle Scholar
- Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, Sementchenko V, Piccolboni A, Bekiranov S, Bailey DK, Ganesh M, Ghosh S, Bell I, Gerhard DS, Gingeras TR: Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 2005, 308: 1149-1154. 10.1126/science.1108625.PubMedView ArticleGoogle Scholar
- Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S, Forrest AR, Zavolan M, Davis MJ, Wilming LG, Aidinis V, Allen JE, Ambesi-Impiombato A, Apweiler R, Aturaliya RN, Bailey TL, Bansal M, Baxter L, Beisel KW, Bersano T, Bono H, et al: The transcriptional landscape of the mammalian genome. Science. 2005, 309: 1559-1563. 10.1126/science.1112014.PubMedView ArticleGoogle Scholar
- Gross SS, Brent MR: Using multiple alignments to improve gene prediction. J Comput Biol. 2006, 13: 379-393. 10.1089/cmb.2006.13.379.PubMedView ArticleGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2010, 38: D46-51. 10.1093/nar/gkp1024.PubMedPubMed CentralView ArticleGoogle Scholar
- GenBank. [http://www.ncbi.nlm.nih.gov/genbank/]
- van Bakel H, Nislow C, Blencowe BJ, Hughes TR: Most "dark matter" transcripts are associated with known genes. PLoS Biol. 2010, 8: e1000371-10.1371/journal.pbio.1000371.PubMedPubMed CentralView ArticleGoogle Scholar
- Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, Antonarakis SE, Guigo R: GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006, 7 Suppl 1: S4.1-S4.9. 10.1186/gb-2006-7-s1-s4.Google Scholar
- Cheng Z, Ventura M, She X, Khaitovich P, Graves T, Osoegawa K, Church D, DeJong P, Wilson RK, Paabo S, Rocchi M, Eichler EE: A genome-wide comparison of recent chimpanzee and human segmental duplications. Nature. 2005, 437: 88-93. 10.1038/nature04000.PubMedView ArticleGoogle Scholar
- Ast G: The alternative genome. Sci Am. 2005, 292: 40-47. 10.1038/scientificamerican0405-58.PubMedView ArticleGoogle Scholar
- Huang da W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4: 44-57. 10.1038/nprot.2008.211.PubMedView ArticleGoogle Scholar
- SplitSeek. [http://solidsoftwaretools.com/gf/project/splitseek/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.