A time course of planarian head regeneration
To study the dynamic changes in gene expression during head regeneration, we collected samples at 16 different time points between 30 minutes and 3 days after head amputation, as well as two control samples frozen immediately after decapitation. To facilitate the detection of genes expressed in or proximal to the blastema, we extracted mRNA specifically from anterior pre-pharyngeal tissue rather than from whole animals (Figure 1a). We prepared seven fragmented cDNA libraries for 2 × 36-bp paired-end sequencing, each including material from two to four pooled samples (Figure 1b). Sequencing on an Illumina Genome Analyzer II yielded more than 336 million raw reads (168 million read pairs), of which 274 million (81.5%) could be mapped to supercontigs of the preliminary S. mediterranea genome assembly using Tophat [14]. While good correspondence between MAKER gene predictions and Illumina read coverage was observed in some cases (Figure 1c), we frequently detected transcription from genomic regions lacking annotation (Figure 1d). To identify differentially expressed loci independent of prior gene annotation, we therefore used our short-read transcriptome sequencing (RNAseq) data to assemble the expressed transcriptome de novo.
De novoassembly with Velvet and Oases
The assembly of transcripts needs to account for alternative splicing events as well as post-transcriptional sequence modifications, for example, poly-adenylation of RNAs. After filtering the dataset for low base-calling quality, we employed a two-step strategy to assemble the remaining 318 million (94.6%) high quality reads: we first generated a preliminary assembly using Velvet [15], incorporating 187 million reads (58.8%), followed by the construction of transcripts by Oases [16]. We obtained 26,018 transcripts, corresponding to 18,780 non-overlapping sequences (Figure 2a; Illumina) with a minimum length of 200 bp.
To assess the quality of this assembly, we first compared it to results obtained with a complementary sequencing technique, Roche 454 sequencing. Recently, two independent studies generated 454 sequence datasets from different stages and tissues of S. mediterranea [17, 18]. To generate reference sequences for comparison with the Illumina assembly, we assembled these datasets, separately or combined into a single set of 454 reads, using the isoform-aware assembler Newbler 2.5. As expected, combining reads from both 454 datasets significantly improved both individual assemblies, as judged, for example, by the improved average and maximum lengths of the assembled transcripts (N50 = 1.1 kb, longest sequences = 12.2 kb) (Figure S1a-c in Additional file 1) and an improved orthology hit ratio (Figure S1d in Additional file 1). To evaluate assembly quality, we compared our short-read de novo assembly with the assembly obtained with the combined 454 datasets ('454').
Transcriptome assemblies based on Illumina or 454 data yielded similar numbers of isogroups (non-overlapping sequences, from hereon referred to as genes) and isotigs (isogroups and their putative splice isoforms, from hereon referred to as transcripts) (Figure 2a), as well as comparable mean sequence lengths (454, 946 bp; Illumina, 1,005 bp). Yet, their length distributions differed, with the Illumina assembly being strongly skewed towards longer sequences, reflected in a high weighted median statistic (N50 = 1.5 kb) and greater maximum transcript length (16.7 kb), and the 454 assembly presenting a more symmetrical distribution with a median of 839 bp (Figure 2b), approximately twice the length of the reported raw reads (Figure S1e in Additional file 1). Both assemblies reached an average length greater than the computational set of predictions made by MAKER (mean, 796 bp; median, 624 bp).
To investigate whether the increase in average sequence length observed in our de novo assemblies was likely to reflect improved gene models rather than artifacts due to greedy assembly algorithms, we identified the closest homologs for all genes in the genome of Schistosoma mansoni, the evolutionarily closest species with a high-quality gene annotation, and determined the ortholog hit ratios for assembled or predicted sequences (Figure 2c) [19]. Both 454- and Illumina-based assemblies display similar bimodal ratio distributions: one group of genes achieved an ortholog hit ratio close to 1.0 and was therefore likely to contain near full-length sequences. A second peak, at a ratio of approximately 0.15, indicated that a roughly similarly sized group contained genes considerably shorter than their best blast homolog in S. mansoni. Most of the computational MAKER predictions fell into the latter group, highlighting the validity of the additional information available through transcriptome sequencing.
As an alternative way of assessing the quality of our assemblies, we compared the 125 full length S. mediterranea cDNA sequences available from NCBI's GenBank with their best reciprocal blat hits from each assembly. Most known genes were well represented in each assembly (Figure 2d). For example, the Illumina+ assembly contained near full-length sequences (median of 92.9% cDNA sequence recovered) for 75 (60%) of the 125 known genes.
Next, we mapped the assembled genes onto the approximately 43,000 genomic supercontigs using blat [20]. As no genomic information had been used to construct the transcriptome, an independent convergence of de novo assembled and genomic sequences would indicate a high quality of the assembly process. More than two thirds (67%) of all genes assembled from Illumina data could be matched to the genome with alignments including more than 90% of each transcript length. With the same settings, a considerably larger fraction, 79%, of the sequences assembled from 454 data could be matched (Figure 2e). By allowing alignments including only 60% of the gene length, nearly all of the sequences from the 454 assembly (96.1%) and a very large fraction of the Illumina dataset (87%) could be located on a single genomic supercontig.
The draft genome assembly is highly fragmented and 44% of all supercontigs are shorter than 10 kb (median length of 11.3 kb), putting them into the same size range as the longest sequences in our de novo assemblies (Additional file 2). We therefore inspected the partial alignments of long gene loci that could not be aligned to any single supercontig. We identified 1,449 sequences (length > 1,000 bp) with non-overlapping, high-scoring alignments to different supercontigs. Of these, 413 displayed significant homology to proteins in the NCBI non-redundant protein database overlapping with the putative supercontig junctions, lending independent support to the validity of our de novo assemblies (Additional file 3; see Materials and methods for details). For example, four transcripts from Gene_1033 up to 13.4 kb long were assembled from Illumina short-read data, while only short fragments of these transcripts could be assembled using the 454 datasets. The transcripts could be aligned to supercontig v31.000152 at their 5' ends (6.2 kb, 47% of the longest transcript), but matched supercontig v31.005068 at their 3' ends (6.1 kb, 45.5% of the longest transcript) with more than 99% identity between cDNA and genome sequence (Figure 3).
The closest mouse ortholog, the Hy-3 gene, is homologous to both the 5' and 3' end of the transcripts' sequences and aligns to the same genomic regions, pointing towards the physical continuity of these two supercontigs. We tested this hypothesis experimentally and confirmed it by PCR amplification and Sanger sequencing of a supercontig-spanning sequence (Figure 3; Additional file 4). This exemplifies the potential for de novo transcriptome assemblies to aid in refining the S. mediterranea genome, similar to a recent case study performed on the Caenorhabditis elegans genome [21]. Closer inspection of the alignment of mouse Hy-3 also revealed overlap with the adjacent, separate gene identified in our assembly (Figure 3, Gene_8274), which is likely continuous with Gene_1033.
To independently verify the expression of individual genes assembled from Illumina short-read data, we picked 14 sequences for experimental validation of expression and amplicon size by RT-PCR, all of which could be detected at the expected sequence lengths, further demonstrating the accuracy of our de novo assembly (Additional file 5).
Based on these combined results, we conclude that our paired-end Illumina transcriptome assembly contained high quality, often near full-length sequences. To improve the assembly further, while maintaining the ability to assemble multiple isoforms for each gene, we repeated the Velvet/Oases assembly of the Illumina reads, this time providing the result of the 454 assembly and EST sequences obtained from GenBank as scaffolds to the algorithms. This allowed us to connect isolated clusters of assembled Illumina reads via bridging with longer sequences, while still requiring a minimum short read coverage of the final sequence. This 'assisted assembly' yielded 24,669 isotigs/transcripts, grouped into 17,465 isogroups/genes and achieved a further increase in average and maximum transcript lengths (maximum length, 17.6 kb; N50, 1.6 kb; Figure 2a). We therefore chose this dataset, from hereon referred to as 'Illumina+', for further characterization and the identification of differentially regulated genes.
To provide a preliminary annotation of the Illumina+ assembly, we performed a blastx search of the longest transcript from each gene against the NCBI non-redundant (nr) protein database and identified homologous protein sequences for 10,112 out of 17,465 genes (57.9%, e-value cutoff of 10-3). Focusing on the highest scoring match for each sequence revealed that the largest number of top hits originated from S. mansoni (2,015 hits, 19.9%) or Schistosoma japonicum (845 hits, 8.4%) (Figure 4; Additional file 6), trematode parasites from the Platyhelminthes phylum. Next, we used internally consistent Gene Ontology (GO) annotations from the top 20 blast hits to provide a preliminary functional annotation to the de novo assembly with the Blast2GO tool [22] and obtained predictions for 6,678 (66.0%) of the genes with significant blast hits. Among the most frequently assigned 'GO biological process' terms were, for example, signal transduction (GO:0007165), response to stress (GO:0006950) and cell differentiation (GO:0030154), indicating that our assembly captured at least part of the regulatory kit of planarian cells (Figure 4b; Additional file 7).
Dynamic gene expression during head regeneration
To identify genes dynamically regulated in response to tissue loss and during early head regeneration, we mapped the reads from each of the seven Illumina libraries, composed of samples collected from two control samples (0 h) and between 0.5 and 1 h, 2 to 3 h, 4 to 8 h, 10 to 18 h or 24 to 72 h post-amputation, respectively, to the Illumina+ assembly using bowtie [23]. On average, 57.9% of all reads could be mapped uniquely to the assembly, corresponding to a total of between 11.7 and 16.3 million counts for each library (Additional file 8).
To compare across different samples, we normalized the data to account for differences in the total number of reads per library and identified genes differentially expressed compared to the control samples (time point 0; see Materials and methods section for details; Figure 5a; Additional files 7 and 9) [24, 25]. We identified 1,143 significantly regulated loci (adjusted P-value < 0.001 and a log2 fold change of ± 0.7 at one or more time points), with many genes displaying highly dynamic expression patterns during the recorded time course (Figures 5a and 6a, b; Additional file 9).
To verify the accuracy of our global transcription profiling results, we selected ten genes, both differentially expressed and negative controls, for in-depth validation by quantitative RT-PCR (qRT-PCR; Figure 5b). The analysis of biologically and technically independent time courses covering the first 24 h of regeneration revealed a very high concordance between RNAseq and qRT-PCR. For example, both approaches revealed strong upregulation of Gene_5777 and Gene_3164 as early as 1 h after amputation (Figure 5b; Additional files 7 and 9), with expression levels decreasing again by 6 h. In addition, the induction of Gene_17538 was detected between 3 and 6 h of regeneration, with both methods (Figure 5b; Additional files 7 and 9). These highly dynamic temporal expression profiles were also detected in an independent SOLiD RNAseq experiment recording changes after head and tail amputation (see below). This high reproducibility of expression profiles between independent technologies demonstrates the power of RNAseq for differential expression analysis of biological samples.
We next extended our analysis to the full Illumina time course to study global patterns of gene expression during regeneration. At each time point, unique as well as overlapping sets of differentially expressed genes were detected (Figure 6a, b). For example, out of 249 significantly regulated sequences detected at the 0.5 to 1 h time point, 111 (44.6%) were also detected as differentially regulated at one or more of the following time points, lending further support to each individual observation (adjusted P-value < 0.001 and a log2 fold change of ± 0.7; Figure 6a). To reveal the underlying regulatory dynamics, we transformed the expression changes of all significant genes to z-scores and applied an unsupervised learning technique, k-means clustering, and identified five distinct temporal classes (Figure 6c). Cluster 1 featured genes that showed early and sustained induction throughout the time course of head regeneration (indicated by a steady transition from negative to positive z-scores). Cluster 2 contained genes that were upregulated rapidly and transiently within the first 8 h of regeneration, when wound healing, immune responses and stem cell proliferation take place. Genes in cluster 3 were upregulated after 10 h of regeneration, when the blastema forms. At this stage, genes grouped into clusters 4 and 5 began to show decreased expression, with strongest down-regulation detected between 10 and 18 or 24 and 72 h, respectively, when cells undergo migration, differentiation, and patterning processes.
Of the total 1,142 differentially expressed genes clustered into the 5 temporal classes, 273 (23.9%) were associated with GO slim annotation terms, allowing us to search for functional categories significantly over- or under-represented among genes with dynamic expression patterns compared to the full set of loci (Additional file 10). Despite the sparsity of the preliminary annotation, we detected a significant overrepresentation of putative serine-type endopeptidases in cluster 5 and their putative inhibitors in cluster 3. In contrast, we found cluster 4 to be specifically enriched in regulators of synaptic transmission and putative membrane transporters.
To validate our findings and collect additional data during the early stages of regeneration, we performed a second independent RNAseq experiment aimed at identifying differential gene expression after amputation of both head and tail regions from strand-specific RNAseq libraries using SOLiD technology. We collected samples from planarians regenerating both head and tail regions at 0 h (control), 1 h, and 6 h after amputation, and prepared strand-specific whole transcriptome sequencing libraries from polyA RNA (Figure 7a, b). For each sample, we obtained between 59 and 64 million raw reads using SOLiD V3 chemistry. Of these reads, 46% could be mapped to the Illumina+ transcriptome (Additional file 8), demonstrating that our assembly provided sufficient coverage to serve as a reference for the analysis of data from other experimental samples or sequencing methodologies. Mapping the strand-specific SOLiD data revealed that most genes (15,739 out of 17,459, 90.1%) showed strong strand bias, with more than ten times more reads mapping to the forward than the reverse strand or vice versa. This provided further support for the success of our de novo assembly strategy and allowed us to discern the direction of transcription for the majority of loci.
An EGR-like transcription factor is induced early in response to injury
Gene_5777 is within the group of early up-regulated genes in cluster 2 of the Illumina RNAseq time course (Figure 6c). Its expression was strongly induced within the first hour after decapitation (log2FC > 3), and dropped to low levels between 3 and 6 h, an expression change that was confirmed by SOLiD sequencing (Figure 7c, d). Sequence analysis revealed that Gene_5777 maps to genomic contig v31.019596 and encodes a putative EGR transcription factor, which we called Smed-EGR-like1 (accession number [GenBank: JF914965]). To test where in the animal smed-egr-like1 was expressed, we performed whole mount in situ hybridization of intact and regenerating animals. We did not detect smed-egr-like1 mRNA in either intact animals or in animals after 6 or 24 h of regeneration (Figure 7d-f, and data not shown), but the gene was strongly expressed in broad domains at both anterior and posterior blastemas 1 h after the cut (Figure 7f). To evaluate whether this up-regulation was caused by the loss of tissue (the result of a complete transverse cut through the animal) or in response to wounding alone, we performed a complementary experiment by creating small incisions in otherwise intact animals (schematic shown in Figure 7e). As early as 1 h after injury, smed-egr-like1 was strongly induced around the sites of incision, including cells located several cell diameters away from the wound (Figure 7f). These experiments validated the results of our RNAseq analysis and indicated that smed-egr-like1 is expressed in response to injury, potentially in response to a signal rapidly spreading from the site of tissue damage.
A Runx transcription factor is required for proper regeneration of the visual system
In contrast to smed-egr-like1, the amputation-induced expression of Gene_17538 was sustained for at least 3 days (Figure 6c, cluster 1). Consistent with our SOLiD and Illumina RNAseq data, the Gene_17538 transcript was undetectable in non-regenerating animals or within the first hour after injury (Figures 7g and 8b). However, at 6 h after decapitation, Gene_17538 mRNA was detected in laterally enriched, discrete cells in close proximity to the cut (Figure 7g). Gene_17538 expression was stronger in the more anterior tissue of the same fragment, indicating potential sensitivity of its expression to an anterior-to-posterior gradient.
To determine how long this gene was expressed during regeneration, we repeated and extended the time course using qRT-PCR. Gene_17538 was upregulated more than 5-fold as early as 2 to 3 h after decapitation, reached maximum induction (more than 16-fold) after 6 h, and maintained a high expression level during the first 48 h of regeneration, before slowly returning to near basal levels after 1 week (Figure 8a). Sequence analysis revealed that Gene_17538 maps to supercontig v31.001002 in the planarian genome and encodes a putative Runx transcription factor. Its predicted protein product shares 49% sequence identity with the Drosophila Lozenge protein in the conserved Runt domain (Additional file 11). We named Gene_17538 smed-runt-like1 (accession number [GenBank: JF720854]), as we found at least one more putative Runx transcription factor (Gene_20170), which mapped to genomic contigs v31.007764 and v31.028733 but did not seem differentially expressed at the time points tested (Additional file 7).
After 5 days, smed-runt-like1-positive cells could no longer be detected in posterior-facing blastemas. Instead, smed-runt-like1 was expressed in a subset of cells along the regenerating brain (Figure 8c) at the time when new photoreceptor neurons connect to the brain and the animal recovers vision [9]. Interestingly, after 5 days, smed-runt-like1-positive cells appeared also in trunk-regenerating head pieces, indicating a putative role in general tissue remodeling and re-scaling of the head during trunk-regeneration (Figure 8c).
To test for a functional role for smed-runt-like1 during head regeneration, we inhibited its expression through RNAi by injecting long double-stranded RNAs (dsRNAs) into the gut of planarians. mRNA levels of smed-runt-like1 were strongly reduced, but not completely eliminated after RNAi, possibly due to very rapid kinetics of its induction after decapitation (Additional file 11). However, using two independent dsRNAs, we observed reproducible defects in eye regeneration, with phenotypes including abnormal rhabdomere morphology, mispositioned photoreceptor neurons and misrouted axons (Figures 8g-j; Additional file 11). These phenotypes are consistent with the role of lozenge during fly larval development, where this gene is expressed in photoreceptor precursor cells [26] to control cell fate decisions and patterning of the visual system through transcriptional regulation of other transcription factors, such as the homeodomain-containing protein Prospero [27]. Whether smed-runt-like1 controls the development of eyes during planarian regeneration through similar mechanisms and target genes as in the fly remains the subject of future studies.
Consistent with the smed-runt-like1 expression pattern, axon guidance phenotypes were also observed in 21% of all trunk-regenerating head fragments (Figure 8j). This suggests that the developmental programs controlling regeneration might also be active during tissue remodeling and body re-scaling.