TRAPID: an efficient online tool for the functional and comparative analysis of de novoRNA-Seq transcriptomes
© Van Bel et al.; licensee BioMed Central Ltd. 2013
Received: 2 September 2013
Accepted: 13 December 2013
Published: 13 December 2013
Transcriptome analysis through next-generation sequencing technologies allows the generation of detailed gene catalogs for non-model species, at the cost of new challenges with regards to computational requirements and bioinformatics expertise. Here, we present TRAPID, an online tool for the fast and efficient processing of assembled RNA-Seq transcriptome data, developed to mitigate these challenges. TRAPID offers high-throughput open reading frame detection, frameshift correction and includes a functional, comparative and phylogenetic toolbox, making use of 175 reference proteomes. Benchmarking and comparison against state-of-the-art transcript analysis tools reveals the efficiency and unique features of the TRAPID system. TRAPID is freely available at http://bioinformatics.psb.ugent.be/webtools/trapid/.
Technological advances in sequencing have made it possible to rapidly and cost-effectively take a snapshot of gene expression in a specific tissue or condition and have led to an explosion of transcriptome RNA-Seq data. With the Petabase barrier having been reached at the NCBI Short Read Archive (SRA) database at the end of 2012 , new approaches to deal with this surge in data quantity are required. For the plant kingdom alone, more than 4,200 transcriptome experiments covering more than 390 species are available at the SRA. Over 90% of these species do not have an available draft or complete genome sequence, making the data processing and biological interpretation a challenging task. In case a reference genome is available, the short reads can be processed using alignment-first (or align-then-assemble) methods that provide a genome-guided approach to study splice site junctions, identify new or alternative transcripts, or to quantify expression levels using known gene annotations . In contrast, for species without a reference genome, assemble-then-align methods require that the millions of reads are first processed using de novo assembly before the reconstructed transcriptome is further characterized . Examples of downstream analysis include the remapping of the input sequence reads from the different libraries to the assembled transcripts to quantify expression levels, the remapping of all reads to assess the genetic diversity within a genotype, or the alignment of the assembled transcripts against genome or transcripts sequences from closely-related species.
The development and improvement of de novo transcript assembly tools is an active research field and algorithms like OASES/Velvet, Trans-ABySS, and SOAPdenovo [3–6] provide efficient tools to reconstruct transcriptomes for non-model species starting from raw sequence reads. Despite the fact that both library normalization and increased sequencing depths (or higher coverage) will have a positive influence on the completeness of a transcriptome , most de novo transcriptome studies typically present gene catalogues where the number of transcripts after the assembly phase exceeds the estimated number of genes . This pattern is mainly the result from redundancy caused by the presence of partial, unassembled, or highly heterozygous sequences. Despite these imperfections, de novo transcriptomes provide a sequence backbone for various non-model species and, in line with traditional genome projects, the detailed annotation of these transcript sequences is an essential step for downstream biological analysis.
Although the workflow to process transcriptome data is highly dependent on the type of analysis, functional annotation for the assembled transcripts is often generated using sequence similarity searches against a reference database. Clearly, the default application of large-scale sequence similarity searches against databases like NCBI or UniProtKB, which contain annotated proteins, drastically increases the amount of data that needs to be interpreted to derive functional annotations. Currently, systems like KEGG Automatic Annotation Server (KAAS) , Blast2GO , and T-ACE  provide tools for non-expert users to perform functional characterization of transcript sequences, but both the throughput as well as the quality of the reference datasets are important factors influencing the biological knowledge that can be extracted from non-model transcriptomes. Whereas systems like KAAS and Blast2GO can be operated through an Internet browser, T-ACE requires the installation of a PostgreSQL database on local hardware. Although both Blast2GO and T-ACE can derive functional annotations from a BLAST search against NCBI or through protein domain identification using InterProScan, the associated runtimes grow rapidly, hindering the efficient processing of a complete transcriptome dataset. Furthermore, the quality of the functional annotations of known sequences as well as the number of species or genes included in reference databases will have an impact on the success of translating transcript sequences into functional gene catalogs. Tools which apply the Gene Ontology (GO) controlled vocabulary benefit from the different functional levels embedded in the ontology structure, while systems like KEGG Orthology provide detailed information but only for a limited number of genes. Apart from functional annotations, the analysis of transcripts from non-model species using comparative genomics can also generate valuable information about conserved pathways, gene family expansions, species-specific genes, and genetic diversity [12–15]. However, performing such evolutionary analyses for thousands of transcripts is computationally expensive and user-friendly interfaces to compare de novo transcriptomes with high-quality reference genomes are still missing.
To address some of the issues inherent to the analysis of de novo transcriptomes, we present TRAPID, a web-based and high-throughput analysis pipeline that uses predefined reference databases. Available analyses include the automatic identification of coding sequences in transcripts, correcting frameshifts, assigning coding-sequences to multi-species gene families, performing transcript quality control, and generating functional annotations. Furthermore, detailed multiple sequence alignments and phylogenetic trees can easily be generated providing a comparative framework for the analysis of non-model transcriptomes. Finally, quantitative comparisons can be performed to study functional biases in transcriptome subsets derived from different tissues or conditions.
General properties of the TRAPID transcriptome analysis tool
Overview and content of the TRAPID reference databases
Gene family information
OrthoMCL-DB version 5
Gene Ontology, InterPro domains
Viridiplantae (green plants)
TribeMCL clustering + integrative orthologs
The output of the sequence similarity searches is used to assign each transcript to a predefined gene family and to generate frame statistics to subsequently perform ORF detection. By default these frame statistics are submitted to a simple routine that extracts the associated longest ORF within the frame showing similarity with reference proteins (see Methods). However, this information is also used to predict whether specific transcripts contain putative frameshifts, which can, in a later stage and through the website, be automatically corrected using FrameDP, a self-training tool to predict peptide sequences in mature mRNA sequences . The association of a transcript to a specific gene family is also used to facilitate the transfer of functional consensus Gene Ontology and protein domain information to transcripts. Finally, meta-information with regards to the length of the ORF of a transcript is generated, by comparing the ORF’s length to the average coding sequence length of the genes in the reference gene family.
Whereas the evolutionary analyses are based on predefined gene families from either OrthoMCL-DB or PLAZA, in some cases these families contain multiple out-paralogous sub-types (sub-clades within a family originating from an ancient gene duplication event predating most speciation events in the tree). As a consequence, some transcripts will be assigned to big gene families covering multiple genes, making phylogenetic analysis difficult. Therefore, in case a single species Reference Proteome is selected (Figure 1), it is possible to first assign transcripts to individual reference genes, for example, from a closely-related model species, and in a second phase build custom gene families through the inclusion of PLAZA integrative orthologs. These orthologs were identified using an ensemble method combining OrthoMCL, reconciled phylogenetic trees, colinearity information, and multispecies best hits and inparalogs (BHI) families , including inparalogs. In contrast to homologous gene families, families based on integrative orthology will contain a smaller number of genes, cover less outparalogs, and thus make downstream comparative analyses more feasible and the interpretation of complex families easier. Optionally, the user can also discard some species within a specific gene family in order to reduce the number of proteins before executing the phylogenetic tree construction routine. However, it is advisable to include as many species as possible in order to maintain a good taxon sampling and reduce phylogenetic error .
Apart from the functional annotation of individual transcripts, TRAPID also supports the quantitative analysis of experiment subsets using GO and protein domain enrichment statistics. Through the association of specific labels to sets of sequences, transcripts can be annotated with specific sample information (for example, tissue, developmental stage, control, or treatment condition) and be used to perform within-transcriptome functional analysis. Based on the integrated functional transcript annotation, enrichment analysis can subsequently be used to study the biological properties of specific experiment subsets or to compare the functional biases present in, for example, a treatment/control transcriptome experiment setup.
The TRAPID platform does not necessarily have to be the endpoint of a transcriptome analysis. In order to facilitate subsequent data analysis, multiple export functions have been added to the platform: all sequences, functional annotations, and phylogenetic data content can be downloaded by the user in a tab-delimited format. Collaborative work on a single transcriptome dataset is encouraged by the ability to share a TRAPID experiment between multiple users. The user who creates the experiment, is considered to be the ‘owner’ (and as such has the ability to empty or remove it), and this user can share his experiment with other TRAPID users (who can browse and edit). This prevents unnecessary data replication or sharing of user credentials.
Evaluation of homology assignments
As shown in Figure 1, the first step is to assign each transcript to a predefined homologous gene family. Because transcriptome datasets for species lacking a reference genome sequence can contain more than 100,000 transcripts  (with a large fraction being fragments, allelic variants, splice variants, or highly expressed non-coding genes), the efficient processing of all these transcripts is essential to provide users with results within a reasonable timeframe. Two sequence similarity tools were considered: BLASTX  and RapSearch2 . The transcript-to-family assignment results were compared using different protein reference databases with varying size, as the size of the database also influences the total runtime. BLASTX is often used to find proteins similar to a query gene in a large database, but requires a large amount of processing time. RapSearch2 was designed to perform the same searches but for short reads, and uses more efficient data structures to significantly speed up this process. Both tools were run using 1,000 randomly selected Arabidopsis thaliana transcripts against different databases containing all proteins from all species within a specific clade, and the correct assignment of a transcript to a family was evaluated together with the running time. In all evaluations the protein sequences of Arabidopsis thaliana and Arabidopsis lyrata were excluded from the database and the known assignments of Arabidopsis thaliana gene sequences to families from the PLAZA 2.5 database were used as a gold standard. Apart from reference databases containing all proteins for a specific species or clade, the ‘Gene family representatives’ database containing 32,294 proteins was also included in the test (see Methods). Assigning a transcript to a gene family was initially done with the 10 best similarity search hits using a simple majority-voting rule (Methods and Additional file 1). It is clear that both BLASTX and RapSearch2 assigned 87% to 98% of the transcripts to the correct gene family in all runs. For most reference databases the runtimes for RapSearch2 were approximately 10× lower compared to BLASTX, while overall, the gene family assignment quality was comparable. Increasing the reference database from one to multiple species (for example, from the Brassicales, which only contains Carica papaya, to Eudicots, covering 11 species) quickly increases the runtimes for both tools. However, better results with regards to the gene family assignment can be obtained by using a larger database. Various metrics, for example, taking only one or multiple hits into account, were evaluated to assign transcripts to families (Additional file 2). The best performance was generally achieved by considering the best hit when using species/clade reference databases and majority voting using the top five hits when using the ‘Gene family representatives’ database. To avoid overfitting (that is, a modeling error which occurs when a function or procedure is leading to a good fit with the sample data but a poor fit with new data) of this method to Arabidopsis thaliana transcripts, this benchmark was repeated using Oryza sativa spp. japonica (excluding Oryza sativa spp. japonica and Oryza sativa spp. indica from the databases) and Vitis vinifera (excluding Vitis vinifera from the databases), yielding similar results (Additional file 3). Although one would expect the correct assignment rate of a transcript to the corresponding gene family to decrease when the assembly quality of the input transcripts deteriorates, this is not always the case (Additional file 4). As such, even relatively short fragments of transcripts (for example, 50 to 100 nt) can be assigned to the correct gene family. Using manual inspection of the amino acid and sequence similarity information, the user is able to modify the association between a transcript and a family in case the automatic gene family assignment is deemed incorrect.
Evaluation of ORF finding routine
In the absence of a reference genome, transcripts generated using de novo assembly of RNA-Seq reads frequently contain errors (for example, short insertions or deletions) and methods for the downstream analysis of coding sequences should be able to correct for potential frameshifts during ORF detection . Although advanced self-learning algorithms (that is, methods that train themselves based on the input data provided by the user) such as FrameDP  exist to correct frameshifts during ORF prediction, running these tools on a complete RNA-Seq transcriptome on-the-fly is computationally unfeasible, even using multi-core or cluster hardware systems. Therefore, we implemented and evaluated a system to first perform the detection of putative frameshifts on all input sequences and subsequently only process these frameshift-containing sequences using FrameDP. This rationale is motivated by the observation that, when running FrameDP on complete plant transcriptomes, such as Helianthus annuus and Pachysandra terminalis[30, 31], in only 3% to 15% of the input sequences a frameshift was identified that could be corrected.
Apart from gene family assignments, the Rapsearch2 output is also used to estimate if a frameshift is expected in an input transcript based on the output from the similarity search. For each input transcript the best hit in the reference database is selected and all alignments between this query and hit gene are evaluated. For each alignment the frame of the transcript hit is determined and if no frameshift is present, all alignments should report the same reading frame, which can immediately be used to extract the corresponding longest ORF (Figure 1). To evaluate this method to identify input transcripts containing frameshifts, we selected 1,000 transcripts from Arabidopsis thaliana containing no frameshifts and an equal amount of genes where one insert or deletion was artificially introduced at a random position in the coding sequence of the transcript (see Methods). Databases of various clades, each time excluding Arabidopsis thaliana and Arabidopsis lyrata, were used along with a database containing ‘Gene family representatives’ , to perform similarity searches. We found that, using these alignment-based frame statistics, 72.8% of all transcripts containing a frameshift were correctly identified, with only few (1.8%) false positives in the dataset lacking frameshifts (Additional file 5). To provide a good balance between global ORF quality and processing time, this method was integrated as the default procedure to identify frameshifts and subsequently correct them using FrameDP. The frame statistics suggest a fraction of frameshifts will be missed, especially when they occur near the 5′ or 3′ end of the gene due to relative small truncation of the full-length protein. As such the TRAPID system also provides an option for the user to run FrameDP on all transcripts within a family context.
Comparison of TRAPID with Blast2GO and KAAS
Feature comparison web-based transcript analysis platforms
Sequence similarity search
NCBI non redundant database
Curated KEGG genes
OrthoMCL-DB version 5, PLAZA 2.5
Gene Ontology, InterProScan, Enzyme codes, KEGG
KEGG (KEGG Orthology groups)
Gene Ontology, Protein domains (InterPro/PFAM)
advanced stand-alone graphical user interface
graphical pathway maps
ORF length meta-annotation, share experiments with other users
Comparison of computation time and transcript coverage for different web-based transcript analysis platforms a
Detection of functional biases in transcriptome subsets using enrichment analysis
Apart from the general characterization of a complete transcriptome using various functional annotation systems, the detailed analysis of genes expressed in specific tissues or developmental stages can provide new insights about the underlying biological processes and their regulation. Again starting from the Panicum hallii transcriptome, we analyzed a set of transcripts showing distinct expression profiles in eight tissues for functional biases . After processing all 25,392 contigs using the Oryza sativa ssp. japonica proteome as a reference and including integrative orthologs from the PLAZA 2.5 database, 16,748 (66%) transcripts were assigned to 9,860 gene families. Based on the results of expression clustering reported by Meyer and co-workers, 6,517 transcripts were tagged with a specific label (cluster 1 to 7) and GO enrichment analysis was performed for each subset. Whereas cluster 1, including transcripts with expression in stem-associated tissues, was significantly enriched for carbohydrate metabolism, cytoskeleton/cell wall organization, and shoot development (Additional file 7), seed-specific transcripts (cluster 5) included genes involved in the generation of precursor metabolites and energy, wax metabolism, and cuticle development (P value <0.05, hypergeometric distribution with Bonferroni correction). Transcripts showing differential expression in root and seedling (cluster 3) were enriched for translation, ribosome biogenesis, and rRNA metabolism, while leaf-specific expression (cluster 6) coincided with photosynthesis, energy metabolism, and multicellular organismal development, confirming previous results . Finally, application of GO queries to tissue-specific subsets allows for the identification of transcriptional regulators involved in development. For example, searching for example ‘transcription factor activity’ on subset root (cluster 4) yields 21 transcription factors showing differential expression in root, including multiple CCAAT-binding, NAM, and bZIP proteins.
TRAPID provides a publicly available tool to process de novo transcriptomes from animals, plants, fungi, and bacteria. This web-based application has been developed to offer a user-friendly interface to functionally characterize assembled transcript sequences and to initiate comparative genomics analyses, enabling scientists with a biological background to explore their non-model transcriptome data in an efficient and high-throughput manner.
Datasets, construction reference protein databases, and selection ‘gene family representatives’
The PLAZA 2.5 database was used as reference and as source for the Arabidopsis transcripts used in the benchmark experiments. The protein databases containing clade-specific content, for both the PLAZA 2.5 and OrthoMCL reference databases, were created by using NCBI Taxonomy  as reference. ‘Gene family representatives’ databases were constructed according to the procedure outlined by Van Bel et al. , in which for each species within a gene family a single gene is selected as representative. This is achieved by creating a graph with genes for nodes and BLAST bitscores for edges, and taking the most central and connected gene as representative. The Pachysandra terminalis dataset was retrieved from Vekemans et al. , Helianthus annuus and Aquilegia formosa x Aquilegia pubescens from TIGR Plant Transcript Assemblies . Panicum hallii transcript sequences were retrieved from Meyer et al.  and contig sequences showing differential expression among tissues were isolated from Supporting Information, file S8.
Similarity search, gene family assignment, and functional transfer using homology
We used RapSearch2 to search for protein hits for each query transcript (comparable to BLASTX), with a user-selectable e-value cutoff. In case the selected protein database consists of either species or clade specific proteins, then only the top protein hit is retained and the associated gene family for this protein is assigned to the transcript. In case the selected protein database consists of ‘Gene family representatives’ , then the top five protein hits are retained, and the gene family for the transcript is selected based on majority voting (see Additional file 2). The functional annotation for each transcript is transferred from its assigned gene family, the best similarity search hit, or a combination of both, depending on the choice of the user. In case the gene family is selected as basis for the functional annotation, the GO terms and protein domains are selected which constitute 50% or more of the size of the gene family. If not a single protein hit was detected during the similarity search, no gene family and no functional annotation is assigned to the transcript.
Assigning frame information, detection and correction of potential frameshifts, and meta-annotation
From each alignment of the top protein hit the strand and frame is determined. If the same frame and strand is detected for each alignment then the longest Open Reading Frame (ORF) within this frame is stored. In case multiple alignments occurred with the target protein in different frames, the transcript was flagged as potentially containing a frameshift and the longest ORF in all possible frames was detected and retained. Using FrameDP version 1.0.3  transcripts with expected frameshifts could be corrected. As a reference database all protein coding genes present in PLAZA 2.5  or OrthoMCL-DB version 5  were provided. FrameDP was configured to run with BLAST 2.2.17 (Expectation value: 1e-3, Open Gap Penalty: 9, Gap Extension Penalty: 2 and retaining only the 100 best hits) while the GC3 split training with three iterations was used. If the total number of selected transcripts is lower than 20, additional random transcripts are added in order to have a good background model. Other parameters were left at their default values. Test datasets that were generated to evaluate the frameshift correction procedure are available via the TRAPID FTP site .
The meta-annotation for all transcripts is determined by comparing the transcript length to the lengths of the coding sequences which constitute its associated gene family. In case no gene family was assigned to the transcript, or in case the associated gene family comprises less than five proteins, the transcript receives the label ‘No Information’ as meta-annotation. Otherwise, the lengths of the coding sequences from the gene family are ordered, and the longest 10% and shortest 10% are removed in order to reduce potential outliers within the reference data. Using the remaining lengths, the average and standard deviation are computed. If the transcript length is shorter than the average minus two standard deviations, the transcript receives the label ‘Partial’ as meta-annotation. If the transcript is longer, it receives the label ‘Quasi Full Length’ as meta-annotation. In case a transcript has meta annotation ‘Quasi Full Length’ , and its associated ORF has both a start and stop codon, than the meta annotation is changed to ‘Full Length’.
Multiple sequence alignments and phylogenetic trees
Using MUSCLE the translated Coding Sequences (CDS) from transcripts belonging to the same gene family were aligned with amino acid sequences of homologous genes present in the reference database. MUSCLE provides a good balance between speed and accuracy  and in order to reduce the computation time for big gene families, the maximum number of iterations in the MUSCLE algorithm is fixed at three (all other settings are left at default). When building a phylogenetic tree, this multiple sequence alignment was edited following the same procedure as outlined in Proost et al. , where alignment columns containing gaps were removed when a gap was present in more than 10% of the sequences (stringent editing), as well as additional positions left and right from the gap. In case the stringent editing yields a stripped alignment with zero or only a few conserved alignment positions, the user can re-run the analysis using the relaxed editing option (gaps removed when present in at least 25% of the sequences). In addition, sequences flagged as ‘Partial’ can easily be discarded when performing phylogenetic tree construction. From this alignment phylogenetic trees can be generated using FastTree2 or PhyML, using the following parameters for protein sequences: ‘-wag and -gamma’ for FastTree, for PhyML WAG substitution model, empirical amino acid frequencies, default number of four relative substitution rate categories, maximum likelihood estimated gamma shape parameter. For both methods the number of bootstrap samples can be specified by the user. The Newick format of the trees was converted to PhyloXML  to allow the dynamic coloring of proteins within the phylogenetic tree.
Implementation and availability
The TRAPID tool is available online . The source code is also available online [39, 40]. To install the software locally, users need to download and install third party software, as described in the TRAPID README file. General documentation (Additional file 8), a tutorial (Additional file 9) and an example dataset  are available on the TRAPID website.
KEGG Automatic Annotation Server
Open Reading Frame
Short Read Archive.
We thank Eli Meyer for providing us with the Panicum transcript dataset, Annick Bleys for help in preparing the manuscript and Bram Slabbinck for useful suggestions. KV and YVDP acknowledges the Multidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to networks’ Project (no 01MR0310W) of Ghent University.
- Kodama Y, Shumway M, Leinonen R: The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res. 2012, 40: D54-D56. 10.1093/nar/gkr854.PubMedPubMed CentralView ArticleGoogle Scholar
- Garber M, Grabherr MG, Guttman M, Trapnell C: Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011, 8: 469-477. 10.1038/nmeth.1613.PubMedView ArticleGoogle Scholar
- Martin JA, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet. 2011, 12: 671-682. 10.1038/nrg3068.PubMedView ArticleGoogle Scholar
- Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25: 1966-1967. 10.1093/bioinformatics/btp336.PubMedView ArticleGoogle Scholar
- Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu AL, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJ, Hoodless PA, Birol I: De novo assembly and analysis of RNA-seq data. Nat Methods. 2010, 7: 909-912. 10.1038/nmeth.1517.PubMedView ArticleGoogle Scholar
- Schulz MH, Zerbino DR, Vingron M, Birney E: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012, 28: 1086-1092. 10.1093/bioinformatics/bts094.PubMedPubMed CentralView ArticleGoogle Scholar
- Wall PK, Leebens-Mack J, Chanderbali AS, Barakat A, Wolcott E, Liang H, Landherr L, Tomsho LP, Hu Y, Carlson JE, Ma H, Schuster SC, Soltis DE, Soltis PS, Altman N, dePamphilis CW: Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genomics. 2009, 10: 347-10.1186/1471-2164-10-347.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao QY, Wang Y, Kong YM, Luo D, Li X, Hao P: Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinformatics. 2011, 12: S2-View ArticleGoogle Scholar
- Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M: KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 2007, 35: W182-W185. 10.1093/nar/gkm321.PubMedPubMed CentralView ArticleGoogle Scholar
- Gotz S, Arnold R, Sebastian-Leon P, Martin-Rodriguez S, Tischler P, Jehl MA, Dopazo J, Rattei T, Conesa A: B2G-FAR, a species-centered GO annotation repository. Bioinformatics. 2011, 27: 919-924. 10.1093/bioinformatics/btr059.PubMedPubMed CentralView ArticleGoogle Scholar
- Philipp EE, Kraemer L, Mountfort D, Schilhabel M, Schreiber S, Rosenstiel P: The Transcriptome Analysis and Comparison Explorer–T-ACE: a platform-independent, graphical tool to process large RNAseq datasets of non-model organisms. Bioinformatics. 2012, 28: 777-783. 10.1093/bioinformatics/bts056.PubMedView ArticleGoogle Scholar
- Baldo L, Santos ME, Salzburger W: Comparative transcriptomics of Eastern African cichlid fishes shows signs of positive selection and a large contribution of untranslated regions to genetic diversity. Genome Biol Evol. 2011, 3: 443-455. 10.1093/gbe/evr047.PubMedPubMed CentralView ArticleGoogle Scholar
- Poelchau MF, Reynolds JA, Denlinger DL, Elsik CG, Armbruster PA: A de novo transcriptome of the Asian tiger mosquito, Aedes albopictus, to identify candidate transcripts for diapause preparation. BMC Genomics. 2011, 12: 619-10.1186/1471-2164-12-619.PubMedPubMed CentralView ArticleGoogle Scholar
- Tzika AC, Helaers R, Schramm G, Milinkovitch MC: Reptilian-transcriptome v1.0, a glimpse in the brain transcriptome of five divergent Sauropsida lineages and the phylogenetic position of turtles. Evodevo. 2012, 2: 19-View ArticleGoogle Scholar
- Vandepoele K, Van de Peer Y: Exploring the plant transcriptome through phylogenetic profiling. Plant Physiol. 2005, 137: 31-42. 10.1104/pp.104.054700.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao Y, Tang H, Ye Y: RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics. 2012, 28: 125-126. 10.1093/bioinformatics/btr595.PubMedPubMed CentralView ArticleGoogle Scholar
- Van Bel M, Proost S, Wischnitzki E, Movahedi S, Scheerlinck C, Van de Peer Y, Vandepoele K: Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol. 2012, 158: 590-600. 10.1104/pp.111.189514.PubMedPubMed CentralView ArticleGoogle Scholar
- Chen F, Mackey AJ, Stoeckert CJ, Roos DS: OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006, 34: D363-D368. 10.1093/nar/gkj123.PubMedPubMed CentralView ArticleGoogle Scholar
- Chen F, Mackey AJ, Vermunt JK, Roos DS: Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS One. 2007, 2: e383-10.1371/journal.pone.0000383.PubMedPubMed CentralView ArticleGoogle Scholar
- Martinez M: Plant protein-coding gene families: emerging bioinformatics approaches. Trends Plant Sci. 2011, 16: 558-567. 10.1016/j.tplants.2011.06.003.PubMedView ArticleGoogle Scholar
- Proost S, Van Bel M, Sterck L, Billiau K, Van Parys T, Van de Peer Y, Vandepoele K: PLAZA: a comparative genomics resource to study gene and genome evolution in plants. Plant Cell. 2009, 21: 3718-3731. 10.1105/tpc.109.071506.PubMedPubMed CentralView ArticleGoogle Scholar
- Gouzy J, Carrere S, Schiex T: FrameDP: sensitive peptide detection on noisy matured sequences. Bioinformatics. 2009, 25: 670-671. 10.1093/bioinformatics/btp024.PubMedPubMed CentralView ArticleGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32: 1792-1797. 10.1093/nar/gkh340.PubMedPubMed CentralView ArticleGoogle Scholar
- Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ: Jalview Version 2–a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009, 25: 1189-1191. 10.1093/bioinformatics/btp033.PubMedPubMed CentralView ArticleGoogle Scholar
- Price MN, Dehal PS, Arkin AP: FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS One. 2010, 5: e9490-10.1371/journal.pone.0009490.PubMedPubMed CentralView ArticleGoogle Scholar
- Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O: New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010, 59: 307-321. 10.1093/sysbio/syq010.PubMedView ArticleGoogle Scholar
- Zwickl DJ, Hillis DM: Increased taxon sampling greatly reduces phylogenetic error. Syst Biol. 2002, 51: 588-598. 10.1080/10635150290102339.PubMedView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Wasmuth JD, Blaxter ML: prot4EST: translating expressed sequence tags from neglected genomes. BMC Bioinformatics. 2004, 5: 187-10.1186/1471-2105-5-187.PubMedPubMed CentralView ArticleGoogle Scholar
- Vekemans D, Proost S, Vanneste K, Coenen H, Viaene T, Ruelens P, Maere S, Van de Peer Y, Geuten K: Gamma Paleohexaploidy in the Stem Lineage of Core Eudicots: Significance for MADS-Box Gene and Species Diversification. Mol Biol Evol. 2012, 29: 3793-3806. 10.1093/molbev/mss183.PubMedView ArticleGoogle Scholar
- Childs KL, Hamilton JP, Zhu W, Ly E, Cheung F, Wu H, Rabinowicz PD, Town CD, Buell CR, Chan AP: The TIGR Plant Transcript Assemblies database. Nucleic Acids Res. 2007, 35: D846-D851. 10.1093/nar/gkl785.PubMedPubMed CentralView ArticleGoogle Scholar
- Meyer E, Logan TL, Juenger TE: Transcriptome analysis and gene expression atlas for Panicum hallii var. filipes, a diploid model for biofuel research. Plant J. 2012, 70: 879-890. 10.1111/j.1365-313X.2012.04938.x.PubMedView ArticleGoogle Scholar
- Federhen S: The NCBI Taxonomy database. Nucleic Acids Res. 2012, 40: D136-D143. 10.1093/nar/gkr1178.PubMedPubMed CentralView ArticleGoogle Scholar
- FTP TRAPID. [ftp://ftp.psb.ugent.be/pub/trapid/]
- Han MV, Zmasek CM: phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinformatics. 2009, 10: 356-10.1186/1471-2105-10-356.PubMedPubMed CentralView ArticleGoogle Scholar
- CakePHP official website. [http://www.cakephp.org]
- MySQL official website. [http://www.mysql.com/]
- URL TRAPID. [http://bioinformatics.psb.ugent.be/webtools/trapid]
- TRAPID source code on GitHub. [https://github.com/CIGUGent/TRAPID/tree/master/src]
- TRAPID source code on TRAPID FTP. [ftp://ftp.psb.ugent.be/pub/trapid/src/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.