Genome-wide profiling of mouse RNA secondary structures reveals key features of the mammalian transcriptome

Background The understanding of RNA structure is a key feature toward the comprehension of RNA functions and mechanisms of action. In particular, non-coding RNAs are thought to exert their functions by specific secondary structures, but an efficient annotation on a large scale of these structures is still missing. Results By using a novel high-throughput method, named chemical inference of RNA structures, CIRS-seq, that uses dimethyl sulfate, and N-cyclohexyl- N'-(2-morpholinoethyl)carbodiimide metho-p-toluenesulfonate to modify RNA residues in single-stranded conformation within native deproteinized RNA secondary structures, we investigate the structural features of mouse embryonic stem cell transcripts. Our analysis reveals an unexpected higher structuring of the 5′ and 3′ untranslated regions compared to the coding regions, a reduced structuring at the Kozak sequence and stop codon, and a three-nucleotide periodicity across the coding region of messenger RNAs. We also observe that ncRNAs exhibit a higher degree of structuring with respect to protein coding transcripts. Moreover, we find that the Lin28a binding protein binds selectively to RNA motifs with a strong preference toward a single stranded conformation. Conclusions This work defines for the first time the complete RNA structurome of mouse embryonic stem cells, revealing an extremely distinct RNA structural landscape. These results demonstrate that CIRS-seq constitutes an important tool for the identification of native deproteinized RNA structures. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0491-2) contains supplementary material, which is available to authorized users.

30°C for 20 minutes with moderate shaking. The reaction was stopped by the addition of 1 ml ice-cold TRIzol.

CIRS-seq library preparation
For CIRS-seq library preparation, we first prepared pools of each treatment.
The DMS sample was obtained by pooling the 50-100-150-200 mM-treated conditions, while the CMCT sample was obtained by pooling the 10-20-25-50 mM-treated conditions (~1.25 μg each). The samples obtained by treating with 0 mM DMS and 0 mM CMCT were pooled in equal amounts (~2.5 μg each) and constituted the Non-treated (NT) control. Ribosomal RNAs were depleted using the Ribo-Zero Gold Kit (Epicentre). One-third of each sample (corresponding to approximately 100 ng of Ribo-RNA) was subjected to reverse transcription with random hexamers using SuperScript II Reverse Transcriptase (Invitrogen). The reverse transcription reaction was conducted in 1 hour at 42°C, followed by 10 minutes at 70°C to inactivate the reverse transcriptase. The template RNA was then degraded by adding 10 U of Ribonuclease H (Ambion) and incubating for 20 minutes at 37°C. After ethanol precipitation of the cDNA, an adapter modified with a 5'-P group and a 3'-C3 spacer, corresponding to the reverse complement of the standard Illumina TruSeq Small RNA 5' adapter (RC5), was ligated to cDNA 3'-OH termini using 200 U of CircLigase II for 4 hours at 68°C. This approach allowed us to keep the strand-specificity of the library so that each read started 1 nt downstream of the RT stopping point. Then, to enable the ligation of a 5' adapter, the cDNA was treated with T4 Polynucleotide Kinase (NEB) in T4 DNA Ligase Buffer (NEB) for 1 hour at 37°C. After ethanol precipitation, the cDNA was loaded on a TBE-Urea 5% PAGE gel. A gel slice corresponding to 70-200 nt was cut, and the cDNA was recovered by passive diffusion into diffusion buffer (500 mM ammonium acetate, 1 mM EDTA, 10 mM magnesium acetate, 0.1% SDS) for 16 hours at 37°C, followed by ethanol precipitation. The cDNA 5' termini were then ligated to a second adapter, corresponding to the reverse complement of the standard Illumina TruSeq Small RNA 3' adapter (RC3) using 200 U of CircLigase II for 4 hours at 68°C. The adapter-ligated cDNA was then subjected to 15-18 cycles of PCR using standard Illumina TruSeq primers. To remove the adapter dimers, the library was loaded on a 3% (w/v) TBE-agarose gel, and the slice corresponding to 150-300 nt was cut and purified using the NucleoSpin Gel and PCR Clean-up Kit (Macherey-Nagel).

Transcriptome assembly and reads mapping
For reads mapping, we used a recently published mm9 reference genome assembly variant that incorporates E14 ESC SNVs [1]. Prior to mapping, the reads quality was estimated using the FastQC tool v0.10.1 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). The nucleotide positions with a quality score below 20 (Phred33 scale) were trimmed using the fastx_trimmer tool from the FASTX Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). After low-quality position trimming, the reads in which sequencing continued through the 3' adapter sequence (TGGAATTCTCGGGTGCCAAGG) were subjected to adapter clipping using the fastx_clipper tool from the FASTX Toolkit, and the reads shorter than 35 nt were discarded. The reads were then subjected to two mapping rounds on the E14 mm9 genome reference. In the first round, the reads were mapped using Bowtie v1.0.0 [2], allowing for a maximum of 2 mismatches in the seed and allowing multiple mappings (parameters: -n 2 -a -y --best). In the second round, the unmapped reads were truncated to 35 nucleotides and mapped again with the same parameters used in the first round, except for the maximum allowed number of mismatches that was reduced to 1 (parameters: -n 1 -a -y --best). For transcriptome analysis, the full Ensembl mouse gene annotation was downloaded from UCSC (http://genome.ucsc.edu/cgibin/hgTables, Table: ensGene) in the BED format, and the transcript sequences were extracted from the reference E14 genome using the fastaFromBed utility from the BEDTools v2.17.0 suite [3] (http://code.google.com/p/bedtools/). Genome mappings corresponding to Ensembl annotated transcripts were extracted using custom Perl scripts, and converted to Ensembl transcriptome-based coordinates.

CIRS-seq analysis pipeline
First, SAM files with reads mapping for NT, DMS and CMCT conditions were sorted by transcript ID and position, then, using custom Perl scripts, the sum of reads mapping to each position was calculated. Since each CIRS-seq read gives information only on the base immediately preceding the first mapping position, which represents the reverse transcriptase stop point, we subtracted 1+ n (where n was the number of low-quality bases trimmed from reads 5'-end prior to mapping) to reads mapping positions to obtain the coordinates of the RT stop point. Reads mapping to position 1 of transcripts were discarded from analysis since they represent the necessary stop point of reverse transcription. For DMS and CMCT reactivity scores calculation we adjusted the approach previously used by Kertesz and collegues for their PARS score [4]. Briefly, DMS and CMCT reactivity scores were defined as the log 2 of the ratio between the normalized DMS (or CMCT) signal at a given position of the transcript, and the normalized signal in the NT sample at the same position.
To normalize for the different sequencing depth between the DMS and the CMCT conditions with respect to the NT condition, we defined the normalization constants k D and k C as follows: where n N , n D , and n C are respectively the total number of mapped reads in the NT, DMS, and CMCT experiments. Then, normalized signals for DMS, and CMCT samples at position i of a given transcript were calculated as: where n Di and n Ci are respectively the raw reads count at position i in the DMS and CMCT samples.
Since we treated the DMS and CMCT conditions independently, two independent normalizing constants k N_D and k N_C were calculated to respectively normalize the NT condition with the DMS, and the CMCT treated samples as follows: Similarly, the normalized signals for NT versus DMS, and NT versus CMCT samples at position i of a given transcript were calculated as: where n Ni is the raw reads count at position i in the NT sample.
DMS and CMCT scores at position i were then calculated as: and all negative reactivity values were brought to zero.
Final score for position i was calculated as: Scores greater than zero, theoretically, represent transcripts positions reactive to either DMS or CMCT treatment, and so increasing scores are directly proportional to an higher probability of observing such positions in singlestranded conformation.
To obtain normalized reactivity, we performed a 90% Winsorising to remove outliers, by setting each score greater than the 90 th percentile to the value of the 90 th percentile, and then dividing each value by the 90 th percentile to obtain a normalized reactivity ranging from 0 to 1. Positions with reactivities <0.3, 0.3-0.7, and >0.7 were considered respectively as weakly, moderately, and highly reactive.

Correlation between replicates
After calculating normalized reactivities for each transcript in the 2 biological replicates, Pearson correlation between replicates was calculated on the top (sliding offset: 5nt) across each transcript. For individual transcripts analysis, the size of the window was reduced to 3nt (sliding offset: 1nt).

Protein-coding transcripts secondary structures analysis
For the analysis of protein-coding transcripts, we first defined a set of highly dataset to obtain a per-transcript regional mean value, and we then averaged the regional mean values across the entire dataset. The significance for the observed differences was calculated using a two-sided paired Wilcoxon rank sum test.

Comparison of protein coding transcripts and ncRNAs
Transcripts were classified according to the Ensembl annotation (http://genome.ucsc.edu/cgi-bin/hgTables, Table: ensemblSource). To enable intertranscript comparison, we produced base-normalized reactivity scores. Since protein coding transcripts and ncRNAs differ in their GC% content, reactivities for A, C, G, and U residues were averaged independently. Mean reactivity scores for the 4 bases were then averaged to obtain the transcript's reactivity. To avoid biases due to different sequencing depths on different classes of transcripts, only the positions with coverage >50x were considered. Coverage per-base was calculated as the sum of the full length reads covering the base, and the RT-stops at the same position.

Lin28a binding sites analysis
For Lin28a binding sites analysis, the CLIP-Seq dataset GS37114 was downloaded from the Gene Expression Omnibus (GEO) database. Reads were clipped from adapter sequences, and mapped using Bowtie v1.0.0 to the same transcriptome reference used for CIRS-seq analysis. After peak-calling, summits were enlarged by 150 nt upstream and downstream, and average CIRS-seq reactivity was calculated. The significance for the observed differences was calculated using a two-sided Wilcoxon rank sum test.

Inference and representation of RNA secondary structures from reactivity scores
De novo secondary stuctures prediction was performed using RNAStructure v5.6 [5].       G  G  A  A  A  C  T  C  G  A  T  T   G   C  A  T  A  A  T  T  T  G  T  G  G  T  A  A  T   shown. Table S1. Sequence of primers used in this study.