Long non-coding RNAs display higher natural expression variation than protein-coding genes in healthy humans
© Kornienko et al. 2016
Received: 27 October 2015
Accepted: 6 January 2016
Published: 29 January 2016
Long non-coding RNAs (lncRNAs) are increasingly implicated as gene regulators and may ultimately be more numerous than protein-coding genes in the human genome. Despite large numbers of reported lncRNAs, reference annotations are likely incomplete due to their lower and tighter tissue-specific expression compared to mRNAs. An unexplored factor potentially confounding lncRNA identification is inter-individual expression variability. Here, we characterize lncRNA natural expression variability in human primary granulocytes.
We annotate granulocyte lncRNAs and mRNAs in RNA-seq data from 10 healthy individuals, identifying multiple lncRNAs absent from reference annotations, and use this to investigate three known features (higher tissue-specificity, lower expression, and reduced splicing efficiency) of lncRNAs relative to mRNAs. Expression variability was examined in seven individuals sampled three times at 1- or more than 1-month intervals. We show that lncRNAs display significantly more inter-individual expression variability compared to mRNAs. We confirm this finding in two independent human datasets by analyzing multiple tissues from the GTEx project and lymphoblastoid cell lines from the GEUVADIS project. Using the latter dataset we also show that including more human donors into the transcriptome annotation pipeline allows identification of an increasing number of lncRNAs, but minimally affects mRNA gene number.
A comprehensive annotation of lncRNAs is known to require an approach that is sensitive to low and tight tissue-specific expression. Here we show that increased inter-individual expression variability is an additional general lncRNA feature to consider when creating a comprehensive annotation of human lncRNAs or proposing their use as prognostic or disease markers.
Long non-protein coding RNAs (lncRNAs) have emerged as a fundamental new layer of genomic information in diverse species . They are considered to participate primarily in mRNA gene regulation [2–5] and to play roles in development and disease [6–8]. LncRNAs may be medically relevant as prognostic factors, disease markers, and drug targets [9–13]. To date, it is known that lncRNA genes are abundant in the genomes of human (, http://www.gencodegenes.org/stats.html), mouse ([15, 16], http://www.gencodegenes.org/mouse_stats.html), other vertebrates [17–20], plants , and simple model organisms such as C. elegans  and yeast [23, 24]. Although large numbers of lncRNAs have been identified, they have not yet been completely annotated in any organism. Human lncRNAs annotated by the GENCODE project comprise the largest public dataset containing 15,877 lncRNA genes (version 21: http://www.gencodegenes.org/stats/archive.html#a21). Many human annotation projects use cell lines , however, some also use primary tissues [14, 26]. An incomplete annotation may arise from two known features of lncRNAs - low abundance and tight tissue-specificity [14, 25]. Notably, lncRNA annotations differ not just between tissues, but also between closely related cell types [27, 28]. Thus, a comprehensive map of all lncRNA genes in the human genome would require systematic and deep analysis of all human body cell types. A recent attempt to define the human lncRNA landscape used several thousand normal and malignant samples and identified almost 47,000 new lncRNA genes , supporting earlier predictions that lncRNAs may outnumber protein-coding genes in human .
Relatively small numbers of mammalian lncRNAs have been assigned a function. A new functional lncRNA database lists only 181 human transcripts (http://www.lncrnadb.org/, ). While it is possible that some lncRNA transcription is a consequence of the local chromatin state [32–34], the gap between annotation and proven functionality reflects the considerable challenges in the analysis of non-coding compared to coding transcripts [35–39]. A deeper knowledge of lncRNAs as a transcript class has followed from genome-wide characterizations of their biology and genomic features with mRNAs as a reference point (reviewed in [30, 34, 40]). Both types of transcripts are transcribed by RNAPII, possess histone modifications typical of active or inactive genes and can be spliced, capped, and polyadenylated (reviewed in ). However, in addition to the basic lack of an open reading frame and functional translation , some studies have identified characteristics that differentiate lncRNAs from mRNAs. In comparison to mRNAs, lncRNAs are generally found to be more lowly-expressed, show higher tissue-specificity and be enriched in the nucleus [14, 25]. Many lncRNAs initiate from enhancer-like promoters that lack H3K4me3 histone modifications typical of standard mRNA promoters [28, 43], or from repetitive transposable elements normally absent from standard mRNA promoters . In terms of genome and biology features, lncRNAs are usually shorter with fewer exons and show inefficient co-transcriptional splicing  and reduced stability . They also show low sequence conservation and evolve faster than mRNAs [47–49].
One lncRNA feature not yet fully investigated in comparison to mRNAs that may influence identification and functional characterization is their natural expression variation. Protein-coding and lncRNA expression and transcript structure have been shown to be dependent on genetic variation in the human lymphoblastoid cell line (LCL) collection [50–52]. Analysis of protein-coding gene expression in whole human blood shows expression variation attributable to inter-individual (for example, age, BMI) and lifestyle (fasting status, smoking) differences, and technical issues such as sampling time, collection and preparation [53, 54]. In this study we use human primary granulocytes, a relatively pure cell type routinely obtained in clinics from healthy individuals and potentially useful diagnostically, to assess natural variability of lncRNA expression. We first prepared an RNA-seq dataset from 10 healthy individuals to define a human granulocyte transcriptome, not previously available. From this we annotated 6,249 lncRNA transcripts arising from 1,323 previously reported and 268 novel lncRNA loci. We show that examining granulocytes from multiple donors allows the identification of less well expressed, less efficiently spliced, and more granulocyte-specific lncRNAs. We then estimated lncRNA expression reproducibility and variability in granulocyte RNA-seq data from seven healthy individuals sampled in three replicates with approximately 1-month intervals. This inter- and intra-individual comparison demonstrated that although lncRNA expression is reproducible between replicates from the same individual, it is significantly more variable between individuals compared to mRNAs. Analysis of multiple tissues from the GTEx project  and lymphoblastoid cell lines from the GEUVADIS project  supports this conclusion and also shows that higher natural expression variability compared to protein-coding genes is a general feature of lncRNAs. Using the latter dataset we show that natural expression variability markedly influences lncRNA identification as the number of identified lncRNAs increases with the number of donors analyzed and does not reach saturation even with 120 donors. Together, the data show that high expression variability of lncRNAs is an important general feature, which not only additionally distinguishes them from mRNAs, but also will make it necessary to consider the number of individuals in strategies to comprehensively annotate and assign putative functions to lncRNAs in the human genome.
Defining the human granulocyte lncRNA transcriptome
Identification of new lncRNA loci and isoforms
We compared our granulocyte de novo lncRNA annotation to the most commonly used public annotations: GENCODE-v19 (23,898 lncRNA transcripts) , RefSeq (8,236 lncRNA transcripts) , and Cabili et al. (21,630 lncRNA transcripts)  and found that 46 % (736) of granulocyte de novo lncRNA loci were not present in public annotations, while 54 % (855) had a full or partial sense overlap with a publicly annotated lncRNA. Exon comparison with the three public annotations showed that we identified 5,694 new unique exons from 2,986 non-overlapping exonic regions. This shows that granulocytes have a specific lncRNA landscape that needs to be defined prior to granulocyte transcriptome analysis. To further assess the novelty of the annotated granulocyte de novo lncRNA loci we examined the MiTranscriptome lncRNA annotation based on 7,256 RNA-seq libraries from different human tissues, tumors, and cell lines . Together, this shows that while 83 % of the lncRNA loci identified in this study can be found in one of the four above lncRNA annotations, 268 (17 %) are not found (Fig. 1e). To test the reliability of our granulocyte de novo lncRNA annotation we first determined that over 80 % of transcripts were supported by at least one exonic overlap with a spliced EST (human ESTs, UCSC table browser) (Additional file 1: Figure S3A). Second, the MiTranscriptome lncRNA annotation  provided an additional validation as 78 % of our granulocyte de novo annotated lncRNAs were supported by an exonic overlap with a spliced MiTranscriptome lncRNA (Additional file 1: Figure S3B) with a median of 51 % exonic coverage of granulocyte de novo lncRNAs by MiTranscriptome lncRNAs (Additional file 1: Figure S3C). Public lncRNAs annotations had less overlap with our annotation (Additional file 1: Figure S3B) and showed poorer exonic coverage (Additional file 1: Figure S3C) and thus provided support for fewer of our granulocyte de novo lncRNA transcripts. In contrast, de novo mRNAs were well covered by public mRNA annotations and MiTranscriptome (Additional file 1: Figure S3B, D), indicating that the poor lncRNA coverage may arise from incomplete annotation of this transcript type in public annotations. Last, we used exon-spanning RT-PCR to test granulocyte de novo annotated lncRNA splice junctions (Additional file 2E). We confirmed 42 out of 46 tested junctions from 22 granulocyte lncRNA loci. We also cloned lncRNA transcripts from 18 granulocyte de novo lncRNA loci not present in public annotations, to confirm their full-length exon structure, continuity, and chromosome position (Additional file 1: Figures 1F, S4-S8 and Additional file 2F). Cloned sequences were deposited in GENBANK (Additional file 2G). In summary, we created a reliable lncRNA transcriptome annotation in healthy human granulocytes that identifies 1,591 lncRNA loci of which 17 % had not previously been described. Furthermore, we demonstrate that granulocyte de novo lncRNAs in contrast to mRNAs are incompletely represented in public annotations.
Non-mRNA-like features that may confound lncRNA annotation
To examine tissue-specificity we used publicly available RNA-seq data from 34 human cell types (ENCODE project (https://www.encodeproject.org), Illumina Human Body Map Project (http://www.ebi.ac.uk/gxa/experiments/E-MTAB-513) (Additional file 2H). These data were aligned as in Fig. 1b and expression levels calculated for de novo annotated granulocyte transcripts. A transcript was considered granulocyte-specific if its expression in granulocytes was at least three-fold higher than in all other cell types. We found granulocyte-specific expression of 32.5 % (1,927) de novo annotated lncRNA transcripts and 4 % of de novo annotated mRNA transcripts (Fig. 2c, Additional file 1: Figure S9A). This trend was also observed for granulocyte-specific expression over the whole locus, indicating it is not an artifact of the greater number of mRNA isoforms in the de novo annotation (Additional file 1: Figures S9B and S10). The same analysis performed for GENCODE-v19 transcripts that are annotated from multiple sample types shows a decreased percentage of lncRNAs (9.0 %) and mRNAs (1.5 %) identified as granulocyte-specific, but a similarly large difference (six-fold) between the two transcript types (Additional file 1: Figure S9C). Analysis of tissue-specific expression performed separately for the three lncRNA transcript classes described above, shows that ‘in PA’ lncRNAs were more similar to GENCODE-v19 transcripts being depleted for granulocyte-specific transcripts compared to the bulk population (dashed green line, Fig. 2c), while ‘not in PA’ and ‘isoform not in PA’ transcripts showed equal or increased granulocyte-specificity.
Expression level is another feature strongly differentiating lncRNAs and mRNAs. We calculated RPKMs of granulocyte de novo lncRNA and mRNA transcripts in the PolyA+ data used for the de novo annotation, which showed that lncRNA transcripts are 10-fold less abundant than mRNAs (0.65/6.14, respectively; Fig. 2d). We noted that lncRNA/mRNA expression difference was slightly reduced (seven-fold median difference) when analyzing ribosomal-depleted datasets, indicating lncRNA under-representation in PolyA+ RNA (Additional file 1: Figure S11A). Comparing the three lncRNA transcript classes showed that ‘in PA’ transcripts display highest expression and ‘not in PA’ have lowest expression among the three classes in both PolyA+ (see inset, Fig. 2d) and ribosomal-depleted (Additional file 1: Figure S11F) data.
The third feature that may influence lncRNA identification is their reduced polyadenylation efficiency, as this would lower abundance in the PolyA+ fraction usually used for transcript identification. Given our above observation of poorer lncRNA representation in PolyA+ versus ribosome-depleted datasets, we compared transcript abundance in these granulocyte datasets to estimate the enrichment of lncRNAs and mRNAs in the PolyA+ fraction (Fig. 2e). While mRNAs showed a median 2.6-fold enrichment, lncRNAs showed a significantly lower median 1.6-fold enrichment (dashed green line, Fig. 2e). We tested if this difference was influenced by low lncRNA expression levels by splitting transcripts into expression bins (Additional file 1: Figure S12A). This showed that independently of absolute expression levels, lncRNAs show significantly lower PolyA+ enrichment compared to mRNAs. Comparing the three lncRNA transcript classes demonstrated that ‘not in PA’ and ‘isoform not in PA’ transcripts showed significantly lower PolyA+ enrichment than ‘in PA’ transcripts (Fig. 2e).
In addition to these four RNA biology features, we examined four genomic features. This showed that compared to mRNAs, lncRNAs transcripts have significantly fewer exons, their transcription starts are less CG-rich but more repeat-rich, and their exons contain more repeats (Additional file 1: Figures S11B-E and S12C). With the exception of the median exon number, these features were more extreme in ‘not in PA’ and ‘isoform not in PA’ lncRNAs than in the class of ‘in PA’ lncRNAs. Together this shows that new granulocyte lncRNAs identified in this study have less mRNA-like features that further distinguish them from mRNAs compared to the bulk lncRNA population. To support this claim we performed the same analysis for MiTranscriptome mRNAs and lncRNAs , which also shows that lncRNAs not in public annotations have less mRNA-like features (Additional file 1: Figures S14 and S15). Thus we show that features such as tight tissue-specificity and low expression, reduced enrichment in PolyA+ selected RNA and reduced splicing efficiency, not only distinguish lncRNAs from mRNAs, but by reducing their representation in the analyzed transcriptome make their identification more challenging.
LncRNAs are reproducibly expressed within one donor but vary between donors
We next investigated reproducibility of lncRNA expression in healthy individuals to assess if this could also influence the lncRNA discovery. To estimate expression reproducibility within or between donors, we examined expression in granulocytes from seven donors sampled at three time points spaced by at least 1 month (Fig. 3a, Additional file 2A). These 21 samples were subject to ribosome-depleted RNA-seq (Additional file 2B) aligned with STAR and expression levels were determined of all de novo annotated lncRNAs and mRNAs. We first tested if annotated transcripts were reproducibly expressed within one donor, that is, the three time points for each donor should show consistent lncRNA expression (RPKM >0.2) or absence (RPKM ≤0.2) of expression (Fig. 3b top). This analysis was performed separately for transcripts with different expression levels. Expression levels for each donor were calculated by averaging RPKMs from the three time points and a transcript was placed into a bin according to its maximal expression level among the seven donors. We counted the number of reproducibly-expressed transcripts and found that lncRNAs are less reproducible in bins 1 and 2, but above RPKM >2 almost all de novo annotated lncRNAs and mRNAs (99–100 %) were reproducibly detected within one donor. In contrast, comparing expression between the seven donors showed consistent lower expression reproducibility of lncRNAs compared to mRNAs (Fig. 3b, bottom). In the three highest expression bins, mRNAs showed 100 % reproducibility while lncRNA transcripts only reached 95 %. In summary, this qualitative analysis shows that, above an expression threshold (RPKM >2), lncRNAs are as reproducibly expressed as mRNAs within replicates from one donor. However, lncRNAs show less reproducible expression than mRNAs between different donors.
LncRNAs show high expression variability between donors
LncRNA expression is generally lower than that of mRNAs (Fig. 2d, Additional file 1: Figure S11A), which could bias the expression variability analysis, as lower expression will correlate with higher normalized standard deviation values. We controlled for this by distributing transcripts and loci into expression bins (Additional file 1: Figure S17). This showed that while variability anti-correlates with expression level for both lncRNAs and mRNAs, lncRNAs analyzed at the transcript or loci level show consistently more expression variability than mRNAs, independent of absolute expression level. We additionally plotted expression variability against mean expression between all donors for lncRNA and mRNA transcripts and loci (Additional file 1: Figure S18A, B). This showed that the trend lines of the anti-correlation between variability and expression level are clearly distinct for lncRNAs and mRNAs at both transcript and loci level, with lncRNAs displaying higher variability. Thus, high natural expression variability is not an artifact of the general low expression of lncRNAs. To identify the number of lncRNA and mRNA transcripts and loci significantly variable between donors we applied an ANOVA test (aov function in R ) to expression values in all the 21 (that is, seven donors sampled three times) ribosomal depleted RNA-seq samples. We find that 23.9 % (1,069) of lncRNA transcripts but only 4.2 % of mRNA transcripts are differentially expressed between the seven donors (transcripts RPKM >0.2, Fig. 4d). This trend persisted when applying an ANOVA test to expression over whole loci (Fig. 4d, 15.5 % and 4.4 % for lncRNA and mRNA loci, respectively). Importantly, this difference between lncRNAs and mRNAs was persistent when analyzing different expression bins (Additional file 1: Figure S19A). Figure 4e shows an example of a significantly variable lncRNA expressed from chromosome 15. Among the four displayed tracks donors 3 and 4 show higher expression, consistent among three replicates, while donors 1 and 2 show low expression consistent among replicates. Since 25 % of de novo annotated lncRNAs are bidirectional and likely share a promoter with an mRNA (Fig. 1c), we examined if this class resemble mRNAs in having less expression variability. Figure 4f shows that bidirectional lncRNA transcripts more closely resembled mRNAs and were significantly less variable than antisense or intergenic lncRNAs and this trend was also observed in all expression bins and over the whole locus (Additional file 1: Figure S20A-C).
Publicly annotated lncRNAs show less expression variability
To further confirm high lncRNA expression variability and to investigate its impact on lncRNA identification, we analyzed expression variability of publicly annotated (Additional file 1: Figure S21A, B) and of MiTranscriptome (Additional file 1: Figure S22A, B) lncRNAs and mRNAs in our granulocyte RNA-seq data. All annotations confirmed high lncRNA expression variability compared to mRNAs. However, the extent of the lncRNA/mRNA difference was reduced when analyzing public annotations compared to the MiTranscriptome annotation and our de novo granulocyte annotation, which both identified numerous novel lncRNAs. We then estimated expression variability separately for the three lncRNA classes described in Fig. 2a, and found that transcript types ‘not in PA’ and ‘isoform not in PA’ showed significantly higher variability between the seven donors, compared to ‘in PA’ transcripts (Fig. 4g) and this trend was observed in all expression bins (Additional file 1: Figure S23A) and also when analyzing expression over whole locus for ‘new’ and ‘known’ lncRNA loci (Additional file 1: Figure S23B, C). To test this further, we analyzed expression variability of MiTranscriptome lncRNAs classified according to their presence in public annotations (as described in Additional file 1: Figure S14D). This showed that ‘not in PA’ and ‘isoform not in PA’ MiTranscriptome lncRNAs displayed higher expression variability (Additional file 1: Figure S22C), consistent with results for the de novo granulocyte lncRNA annotation. Together this supports our arguments above, that lncRNAs not in public annotations have less mRNA-like features.
A list of robustly or variably expressed lncRNAs in human primary granulocytes
Following the discovery of high intra- and inter-individual expression variability of lncRNAs we sought to generate a list of robustly expressed and variably expressed granulocyte lncRNAs as a resource. To generate the robustly expressed list we filtered 6,249 lncRNA transcripts in our annotation (that is, the set of transcripts that ‘can be’ expressed in granulocytes) to identify those detected (RPKM >0.2) in all replicate samples from seven donors. This gave a robustly expressed annotation of 2,490 transcripts from 393 lncRNA loci (Additional file 7A). We applied stricter criteria and required a higher level of expression (RPKM >1) in every sample to produce another annotation of ‘well-expressed robust’ lncRNAs in granulocytes with 817 transcripts from 115 lncRNA loci (Additional file 7B). A list of significantly variably expressed (defined as in Fig. 4d) lncRNAs with 1,069 transcripts from 214 lncRNA loci is provided in Additional file 8.
LncRNAs expression variability in lymphoblastoid cell lines (LCL)
We used the LCL de novo annotation to calculate the RPKM of every transcript and locus in all 462 donors. An ANOVA test could not be applied due to the absence of donor replicates, but inter-individual variability was calculated from the normalized standard deviation of expression between all donors. Comparing lncRNAs to mRNAs showed that lncRNAs are significantly more variable both when calculating expression of transcripts or over whole loci (Fig. 5d, e). We controlled for expression level by distributing transcripts or loci to expression bins as described above and found that except for bin1 transcripts, lncRNAs were significantly more variable in expression than mRNAs (Additional file 1: Figure S27). To complete the comparison with the granulocyte data, we found LCL bidirectional lncRNAs to be significantly less variable than intergenic lncRNAs in all expression bins (Additional file 1: Figure S28). In addition, LCL de novo lncRNAs not covered by public annotations (‘not in PA’ transcripts) show significantly more expression variation than publicly annotated transcripts (Fig. 5f, Additional file 1: Figure S29). This analysis of an independent cell type with an independent sample collection and processing method from a larger number of donors supports our finding of high inter-individual lncRNA expression variability.
LncRNA expression variability is increased in multiple human tissues
Increased expression variability affects lncRNA identification
We tested this by de novo annotating lncRNAs and mRNAs from a variable number of individuals. We picked 120 GEUVADIS LCL donors (Fig. 7b, Additional file 11A), unified the data by sampling 25 million paired-end reads from each donor and created 30 pools, each with four donors (two male and two female donors) with a total of 100 (25 × 4) million reads. From the 30 pools we created 30 LCL de novo transcriptome assemblies using Cufflinks. We randomly picked 1, 2, 3, 4, 5, 6, 8, 10, 15, 20, 25 (using three replicates of random picking), and 30 assemblies, which corresponded to 4, 8, 12, 16, 20, 24, 32, 40, 60, 80, 100, and 120 donors, respectively, and applied Cuffmerge and the de novo transcriptome annotation pipeline to each group of assemblies (Additional file 1: Figure S31A, B and Additional file 11B). Only one pool (100 million reads) was fed at a time into the assembly pipeline, thus the sensitivity of Cufflinks was unchanged. In addition, assemblies but not reads were merged at this stage. Different number of assemblies fed into our annotation pipeline produced multiple lncRNA/mRNA annotations with different numbers of loci and transcripts. We plotted the number of mRNA and lncRNA loci (averaged from the three replicates described above) versus the number of donors used (Fig. 7c, Additional file 11C). This showed that while lncRNA loci number (green lines) grew three-fold with increasing donor number, from 1,382 loci obtained from four donors to 4,166 obtained from 120 donors, the number of mRNA loci (blue lines) shows a much lower level of increase from 12,085 (four donors) to 12,857 loci (120 donors). This supports the hypothesis that adding more individuals to the identification pipeline increases the number of lncRNA loci but not the number of mRNA loci.
In contrast to the loci analysis, the number of transcript isoforms increased with similar kinetics for both lncRNAs and mRNAs (approximately seven-fold increase from four to 120 donors; Additional file 1: Figure S31C). The difference between lncRNAs and mRNAs is that an increasing donor number allows identification of an increasing number of transcript isoforms inside a stable number of mRNA loci, while lncRNAs retain a low median number of transcripts per locus and increase the number of loci annotated in the genome (Additional file 1: Figure S31D). Note that we did not expect to find non-annotated mRNAs loci since the mRNA de novo identification pipeline was limited to annotated mRNA genes. If the analysis did identify non-annotated mRNA loci they would be recognized among lncRNA candidates that were filtered by the pipeline step that estimated transcript protein-coding capability. However, this step only removed a low-level increase from 83 (four donors) to 198 (120 donors) loci (Additional file 1: Figure S31E). The slight increase in mRNA loci number with increasing donor number (Fig. 7c) likely arises from high inter-individual expression variability of a small number of mRNAs in LCLs. The larger increase in lncRNA loci number also arises from identifying more highly variable annotated lncRNAs when analyzing more donors, but also potentially by identifying novel lncRNA loci.
Assembling transcriptomes from pools of 100 million paired-end read does not increase Cufflinks sensitivity (Additional file 1: Figure S31A), but including more donors into the identification pipeline naturally increased the number of transcriptome assemblies merged and therefore the total amount of the RNA-seq data analyzed (from 1 to 30 × 108 sequencing reads). To control that this strategy did not only lead to the identification of marginally-expressed lncRNAs we plotted the RPKM of lncRNAs added to annotation with the addition of more donors (Additional file 1: Figure S32). This shows that median level of expression (in at least one donor used for identification) of newly-identified lncRNAs is RPKM of approximately 1, which means that 50 % of the newly-identified lncRNA transcripts are well-expressed (RPKM >1). This median level also does not decrease for transcripts that are only found with large donor numbers. In addition, we analyzed the dynamics of lncRNA identification with increasing the donor number in different expression bins (Additional file 1: Figure S33). This shows that lncRNAs from high-expression bins contribute substantially to the overall increase in lncRNA loci and transcript number. For example, four donor annotations identified 314 lncRNA transcripts initiating from 152 different loci in bin4 (that is, at least one donor used for identification expresses the transcript with 4 < RPKM < 8), while annotating from 120 donors identified 3,518 bin4-lncRNA transcripts initiating from 610 loci. Thus, while marginally-expressed lncRNAs are identified by adding more donors to the analysis, they only constitute a fraction of the newly-identified transcripts. Both controls show that identification of an increasing number of lncRNAs cannot be solely attributed to stochastic sampling sensitivity and identification of lowly-expressed transcripts, but likely arises from genuine expression variability between individuals.
We next asked if the lncRNA loci identified with increased donor numbers were new or known loci (as defined in Fig. 2a) and what were the dynamics of their identification. To do this we plotted the normalized number (the number of loci at 120 donors set to 100 %) of known (dark gray) and new (light gray) lncRNA loci versus donor number (Figure S34 in Additional file 1: Figure S34 and Additional file 11C). For comparison the same plot shows the dynamics of mRNA (dashed blue line) and all lncRNA (dashed green line) identification from the data in Fig. 7b. This shows that although the number of known lncRNA loci increases with donor number from 948.5 (four donors,) to 2,103 (120 donors), the number of novel lncRNA loci shows a more striking increase from 433.7 to 2063 loci (2.2-fold and 4.8-fold, respectively; Additional file 1: Figure S34) (note that non-integer loci numbers arise from averaging three replicates). While mRNA loci identification plateaued with four donors, the known lncRNA loci identification curve starts to plateau with >80 donors, but the new lncRNA identification curve does not plateau up to 120 donors.
Finally, we used the most comprehensive de novo annotation from 120 donors as a reference transcriptome to build a ‘donor saturation curve’ to test how well this annotation can be recreated using fewer individuals. We counted the number of reference 120 donor lncRNA and mRNA loci identified (defined by >50 % coverage, Additional file 1: Figure S35A, top, Additional file 11D) using a reduced number of donors. The resulting curve saturates for mRNAs, but does not saturate for lncRNAs even with 120 individuals. Only 27 % of lncRNA loci identified with 120 donors were identified using four donors, this increased to 50 % at 20 donors and thereafter continuing to rise. The difference between known and new lncRNA loci was consistent with observations in Additional file 1: Figure S34. We also assessed how well the exon structure of mRNAs and lncRNAs from the reference 120 donor annotation was recreated by annotations obtained using fewer donors (Additional file 1: Figure S35B). Median exonic coverage of mRNAs was above 90 % just using four donors, whereas lncRNAs require 80 donors to reach similar levels of exonic coverage. In summary, these analyses show that increasing the donor number will identify more lncRNA loci, however, the donor number required is vastly in excess of that required for mRNAs.
An appreciation of the need to define the lncRNA landscape of the whole human genome is increasing with the number of known lncRNAs genes and with an understanding of the unique qualities of their biology. Although the GENCODE annotation comprises the largest public dataset with 15,877 lncRNA genes (version 21: http://www.gencodegenes.org/stats/archive.html#a21), later studies that used several thousand normal and malignant samples from numerous individuals identified four-fold more lncRNA genes . Why the number of lncRNAs continues to rise apparently in excess of protein-coding gene number, is not yet clear. In this study we set out to annotate the lncRNA transcriptome of freshly harvested human granulocytes with the goal of investigating lncRNA inter-individual expression variability and determining how this influences lncRNA identification.
The resulting human granulocyte transcriptome obtained from 10 healthy individuals identified 1,591 lncRNA loci with a mean of 3.9 transcripts per locus. The same granulocytes express approximately six-fold more mRNA loci each with approximately three-fold more transcripts. The reduced activity of lncRNA loci relative to protein-coding loci has been noted [14, 62]. Comparing the granulocyte de novo annotation to the most commonly used public annotations (GENCODEv19: 23,898 lncRNA transcripts , RefSeq: 8,236 lncRNA transcripts , Cabili: 21,630 lncRNA transcripts ) that together contain 19,762 non-overlapping lncRNA loci, shows that one-third of granulocyte de novo lncRNA transcripts are not present and one-third added a new isoform to public-annotated loci. A comparison with the recent massive MiTranscriptome lncRNA annotation containing 46,331 new lncRNA loci , showed that 268 granulocyte lncRNA loci (17 % of the annotated granulocyte lncRNA transcriptome) were not previously reported. With the caveat that different annotation pipelines may influence identification, this shows that human granulocytes have a specific lncRNA landscape that needs to be defined prior to transcriptome analysis, rather than relying on integrative lncRNA landscapes from multiple cell types.
The identification of numerous new human granulocyte lncRNA loci is surprising in view of the extremely large numbers present in public annotations or datasets. Because of this we investigated if specific lncRNA biology features contribute to their under-representation in public databases by assessing if they were more prominent in new loci or isoforms. We first investigated four known features, that is, very tight tissue-specific expression, lower expression level, inefficient enrichment in PolyA+ selected fractions, and inefficient splicing (reviewed in [30, 34, 40]). In each case we demonstrated a significant difference for these features between lncRNAs and mRNAs and, in addition, demonstrated that these features are more prominent in new lncRNA loci and transcript isoforms. For example, reports from different species show that lncRNAs compared to mRNAs have tight tissue-specific expression and also are generally more lowly expressed [14, 15, 17, 18, 25, 65]. We found that while only 4 % of mRNA transcripts display granulocyte-specific expression, 32 % of lncRNA transcripts, and 57 % of novel lncRNA transcripts were granulocyte-specific. Similarly, lncRNA expression levels were 10-fold less abundant than mRNAs, as reported in many species (see above references), however expression of novel ‘not in PA’ lncRNA transcripts was 13-fold less abundant. We could also show that lncRNA enrichment in the PolyA+ fraction relative to total ribosomal-depleted fraction was reduced compared to mRNAs (respective median enrichments of 1.6-fold and 2.6-fold) in agreement with findings that a proportion of lncRNAs are not polyadenylated  and that this reduction was 1.6-fold greater for novel ‘not in PA’ lncRNA transcripts. A relatively new feature reported for imprinted cis-repressor lncRNAs such as Airn and Ube3a-ats [67, 68] and for some lncRNAs in human K562 cells  that could also affect the abundance of full-length transcripts in PolyA+ RNA fractions, is inefficient splicing. We accessed splicing efficiency of lncRNAs and mRNAs in our granulocyte data and showed that compared to mRNAs, lncRNAs are less efficiently spliced with a broad distribution of splicing efficiency. Median lncRNA splicing efficiency was reduced by 10.9 % compared to mRNAs, however, novel lncRNA transcripts showed 22.9 % reduction. We confirmed the inefficient splicing of lncRNAs and the greater reduction in novel lncRNA using the independent MiTranscriptome annotation . Together this analysis shows not only that lncRNAs share several non-mRNA-like biology features, but also that these features are more prominent in new lncRNA loci and transcript isoforms and thus are likely to reduce lncRNA representation in public annotations.
The last feature examined that could influence the incomplete representation of lncRNAs in public databases is that of natural expression variation. We used the granulocyte annotation with seven donors sampled at three time points separated by at least 1 month, to estimate the natural expression variability of lncRNAs relative to mRNAs. This analysis shows that lncRNA expression is unexpectedly highly variable among a population and, while relatively stable over time within an individual, lncRNA expression variation is significantly larger than that of mRNAs independent of expression level. We find that when considering all the 6,249 de novo annotated granulocyte lncRNA transcripts only 40 % (2,490) are robustly expressed, while 17 % (1,069) display significant inter-individual variable expression even within the small sample size of seven donors. Importantly, we show that high natural expression variability is not a consequence of the generally low expression of lncRNAs, as lncRNA transcripts/loci in all expression bins were more variable than mRNAs and also displayed higher percentage of significant inter-individual variable expression assessed by ANOVA test. The high inter-individual variability of lncRNA expression was unique enough to allow unsupervised grouping of replicates sampled over several months according to each of the seven donors. We verified high lncRNA inter-individual expression variability by demonstrating a similar difference for MiTranscriptome annotated transcripts expressed in granulocytes. We also analyzed an independent public RNA-seq lymphoblastoid cell dataset from GEUVADIS . This LCL dataset derived from 462 donors displayed an overall higher median expression variability for both mRNAs and lncRNAs than the granulocyte dataset consisting of seven donors; however, the relative two-fold difference between lncRNAs and mRNAs loci and transcripts was similar. In each of the three above analyses we could show that novel lncRNA transcripts display higher expression variability than known lncRNA transcripts. Lastly, we demonstrated that high lncRNA inter-individual expression variability relative to mRNAs is a general phenomenon in human tissues, by analyzing multiple tissues from the GTEx project . Interestingly, although we analyzed the same number of donors per tissue we found different absolute levels of lncRNA and mRNA expression variability, with skeletal muscle displaying the highest and LCL, nerve, and thyroid displaying the lowest variability level. As an important control, analyzing LCL in the GTEx dataset using the MiTranscriptome annotation showed similar levels of expression variability as that obtained by analyzing the GEUVADIS LCL dataset using our de novo LCL annotation. Overall, these expression variability analyses of public datasets, in additional to our granulocyte analysis presented here, confirm our conclusions and support the general nature of increased lncRNA natural expression variability compared to mRNAs.
Comparison of lncRNA and mRNA expression variability was performed as a small part of two previous studies. One LCL study analyzing splicing variability of protein-coding genes found a small number (183) of GENCODE lncRNAs with consistent higher expression variability than mRNAs, even in the absence of replicates . The second study  reported a similar relative impact of inter-tissue and inter-individual variability to total variance in gene expression for highly expressed (median RPKM >2.5 among 1,641 analyzed samples comprising 43 body sites from 175 individuals) GENCODE-v12 lncRNAs and mRNAs. This implies, given the known increased inter-tissue variability of lncRNAs, that inter-individual variability of lncRNAs is also greater in its absolute value than that of mRNAs. This study additionally reported enrichment of lncRNAs among genes showing differential expression between individuals of different populations. Thus, the findings from both these studies are consistent with our demonstration here of higher natural expression variation of lncRNAs compared to mRNAs.
High lncRNA inter-individual expression variability highlights another striking biology feature that distinguishes lncRNAs from mRNAs. The finding that expression variability is more prominent in new lncRNA loci and reduced in reference lncRNA annotations also indicates it can influence identification. Thus public annotations based on limited numbers of human donors or derived from single animal or plant inbred strains, may have reduced representation of variably expressed lncRNAs. We demonstrate this with the GEUVADIS LCL RNA-seq data derived from one cell type, by showing that adding more donors to the analysis identifies more lncRNA genes in the human genome. The number of lncRNA loci increased continuously, with novel lncRNA showing a more striking increase than known lncRNAs. The MiTranscriptome study that used a donor number per tissue comparable to our LCL analysis  identified three-fold more novel lncRNAs than present in the three commonly used public databases (see above references). Our results also indicate that a granulocyte lncRNA annotation based on 10 donors, is most likely at the lower part of the donor saturation curve for this cell type. Moreover, our finding that the identification of novel lncRNA loci does not plateau even with 120 donors indicates that comprehensive annotation of lncRNAs in the human genome requires as many individuals as possible. The identification of high lncRNA intra- and inter-individual expression variability has implications for identifying lncRNAs and assessing their function and potential medical use. LncRNAs that lack consistent expression in some individuals are unlikely to be necessary for normal cell function, but may be functional in an age, environment, lifestyle, or disease related manner as shown for some protein-coding genes [54, 70]. At the same time, it cannot be assumed that a robustly expressed lncRNA has an important function in the cell type in which it is expressed. For example, the developmentally important Airn lncRNA retains robust expression after performing its silencing function . Our results support the view that functional studies require an understanding of basic lncRNA biology in different individuals before they can be interpreted [36, 72].
The basis of increased inter-individual expression variation of lncRNAs relative to mRNAs is unknown. It may be relevant that, together with a lower conservation and faster evolution rate, human lncRNAs are recently evolved loci, harboring more SNPs than protein-coding genes [49, 73]. LncRNAs may also be more susceptible to environmental and lifestyle factors that contribute to mRNA expression variation . Studies of protein-coding genes and lncRNAs in LCLs prepared from different population groups conclude that both expression strength and alternative splicing contribute to expression variability [50, 69, 74, 75]. How this contributes to differences in lncRNA and mRNA expression variability is not known. Bidirectional lncRNAs that likely share a promoter with a neighboring protein-coding gene are regulated similarly to neighboring protein-coding genes  and we show that compared to intergenic or antisense lncRNAs, expression variability of bidirectional lncRNAs is more similar but still greater, than that of mRNAs. Inter-individual alternative splicing may contribute as some lncRNA loci display unusually high alternative splicing and variable exon structures . However, this is not supported by our observation that expression variation over the whole locus is similar to that of transcript isoforms. LncRNA genes are considered to be similar to mRNA genes as both are transcribed by RNAPII (reviewed in [30, 34]). However, details of their promoters or enhancers that could explain the five non-mRNA-like features highlighted here (tight tissue-specificity, low expression, inefficient PolyA+ selection, inefficient splicing, and high inter-individual expression variation) have not yet been investigated. Some potential gene regulatory features (chromatin-modification patterns, splicing signals) are similar for lncRNAs and mRNAs [14, 18, 25, 78]. Some publications identified non-mRNA-like features in lncRNAs while others stress mRNA-like features, particularly of intergenic lncRNAs [15, 46, 79–81] (reviewed in [30, 34, 40]). The analysis of healthy granulocytes presented here supports the view that a lncRNA subpopulation shows distinct non-mRNA-like features, which now includes high inter- and intra-individual expression variability. Non-mRNA-like features of lncRNAs may have use in their classification, as it is likely to be relevant for their function [82, 83]. We show here that in healthy granulocytes only 40 % (2,490) of lncRNA transcripts are robustly expressed, while 17 % (1,069) of lncRNA transcripts show significant variable expression. The biological significance of robust or variable expression is not yet clear and both classes of lncRNAs may be useful for some studies. However, explanations of lncRNAs in terms of their evolution and function or proposals of their use as biomarkers or therapeutic targets first require an understanding of the robustness of their expression in healthy tissues.
We demonstrate here by analysis of human granulocyte RNA-seq data from multiple individuals that lncRNAs show unusually high natural expression variability compared to mRNAs. We use this dataset to generate a list of robustly and variably expressed granulocyte lncRNAs that will be of use in future applications. We also show that higher expression variability of lncRNAs is a general phenomenon inherent to diverse human tissues and cell lines that is of yet, unknown biological significance. High natural expression variability of lncRNAs, in addition to their tight tissue-specificity, low expression, inefficient PolyA+ selection, and inefficient splicing, identifies a set of five non-mRNA-like features that distinguish part of the lncRNA population from mRNAs and, also reduces their representation in reference annotations. We show that high inter-individual expression variability offers one explanation for the incomplete annotation of lncRNAs in many genomes. Our analysis shows that increasing the number of individuals analyzed will identify more lncRNA loci in the human genome, however, the donor number required is vastly in excess of that required for mRNAs. The finding of high expression variability of lncRNAs and its effect on identification provides novel guidelines for lncRNA annotation and additional considerations for design of functional studies and personalized medicine approaches.
Sample collection from healthy donors
Ten volunteers (five men, five women; age range: 27–62 years) without obvious disease were recruited to donate blood. Seven volunteers donated blood three times with gaps of 5 to 21 weeks (Additional file 2A). The remainder donated once only. Donors abstained from eating on the morning of donation; 45 mL of venous blood was collected between 10:00 and 11:00 into VACUETTE® Sodium Citrate Coagulation Tubes and processed immediately. Granulocytes were isolated using density gradient centrifugation and immediately used for RNA preparation either depleted for ribosomal RNA using the RiboZero rRNA removal kit Human/Mouse/Rat (Epicentre) or a polyA enriched using the TruSeq RNA Sample Prep Kit v2 (Illumina) (details in Additional file 1: Supplemental Methods).
RNA-seq library preparation and read alignment
(a) Non-strand-specific libraries were prepared using the TruSeq RNA Sample Prep Kit v2 (Illumina) following the manufacturer’s protocol. (b) Strand-specific library preparation used same kit with modifications . Equal concentrations of barcoded libraries were pooled for 50 bp or 100 bp paired-end sequencing by Illumina HiSeq 2000 (Biomedical Sequencing Facility http://biomedical-sequencing.at/). After base-calling and sample de-multiplexing, the RNA-seq data were provided as archived .fastq or unmapped .bam files. RNA sequencing reads were aligned using STAR aligner with adjusted default parameters  (details in Additional file 1: Supplemental Methods).
RNA-seq read number
Three stranded samples were sequenced per flow cell lane generating 22 to 79 million 100 bp PE reads per sample. Unstranded PolyA+ RNA-seq samples varied from 24 to 38 million 100 bp PE and 64 to 91 million 50 bp PE reads. In total we obtained 17 PolyA+ RNA-seq datasets and 21 total RNA-seq dataset totaling 2.13 billion reads (Additional file 2B).
Annotating mRNAs and lncRNAs in primary granulocytes
A total of 784 million PolyA+ RNA-seq reads from 10 donors were used to de novo annotate lncRNA and mRNA transcriptomes in granulocytes (see details in Additional file 1: Supplemental Methods). The final de novo annotation of human primary granulocytes was 132,864 mRNAs forming 10,092 genomic loci (average 13.2 transcripts per locus) and 6,249 lncRNAs forming 1,591 genomic loci (average 3.9 transcripts per locus). Assembly quality was assessed by inspecting de novo annotation of well-known lncRNAs like XIST (Additional file 1: Figure S2A) and by analyzing completeness of assembly of RefSeq (Additional file 1: Figure S2B) and GENCODE-v19 (Additional file 1: Figure S2C) annotated mRNAs.
Positional classification of lncRNAs
lncRNA loci and transcripts were divided into three classes based on their relative position to protein-coding genes. We combined de novo mRNA annotation with public protein-coding gene annotations by GENCODE-v19 and RefSeq to obtain the most comprehensive annotation of protein-coding genes in granulocytes. We then called lncRNA loci/transcripts bidirectional if they shared or overlapped a promoter (defined as TSS +/- 1.5 kb) with a protein-coding gene. LncRNA loci/transcripts overlapping a protein-coding gene in the antisense direction were called ‘antisense’ (sense direction overlaps were removed from the annotation). The third position-based class ‘intergenic’, had no overlap with a protein-coding gene.
Cloning of full-length lncRNA transcripts
RT-PCR was performed on granulocyte cDNA to amplify full-length lncRNA transcripts prior to cloning. PCR primers (http://biotools.umassmed.edu/bioapps/primer3_www.cgi) spanned the transcript from first to the last exon and the PCR product length limited to 1.5 kb (Additional file 2F). Isolated plasmid DNA was Sanger Sequenced and aligned to the human genome using BLAT. Cloned sequences are displayed as a UCSC screen shot with the de novo lncRNA annotation, primers, and BLAT alignment (Additional file 1: Figures S4-S8). Seventy-five cloned sequences were submitted to GENBANK (Additional file 2G).
Public RNA-seq data mining
We downloaded publicly available raw strand-specific RNA-seq data (fastq files) from various cell types/tissues produced by the ENCODE project and Illumina Human Body Map Project (see list in Additional file 2H), processed it as for other sequencing data in the study (see: RNA-seq read alignment).
This was calculated using RPKM_count.py (RSEQC package). Expression of a transcript is the RPKM of exons of a one transcript, expression over a locus is RPKM of the whole locus including intronic signal.
Splicing efficiency analysis
We estimated splicing efficiency for each splice site of each multiexonic transcript in our ribosomal-depleted granulocyte RNA-seq from seven donors with three time points pooled at the alignment stage to increase coverage. Splicing efficiency of each splice site was calculated separately in each donor. We calculated RPKM of the exonic and intronic boundaries of the splice site (45 bp each, leaving out 5 bp directly at the splice site to allow for imprecision of splice site identification), calculated the ratio of intronic to exonic signal, and by that estimated how efficiently this splice site was used (Additional file 1: Figure S13A). A splice site was discarded if exonic RPKM was below the cutoff (RPKM = 0.2) in any of the seven donors. We then introduced a value ‘Splicing efficiency’ (S), ranging from 0 for completely unused splice sites (intronic signal equal or higher than exonic signal) to 100 for optimally used spliced splice sites (no intronic signal detected). S = 100*(1-RPKMintronic/RPKMexonic). We replaced all the negative S values (when intronic signal was higher than exonic signal) with 0, defining such cases as full absence of splicing. We averaged the splicing efficiency value calculated from seven donors for each splice site. Splicing efficiency of a transcript was then defined as the maximal splicing efficiency achieved by the most efficiently spliced site of that transcript. Splicing efficiency of a locus was similarly defined by the maximal splicing efficiency among all transcripts (all splice sites) in the locus.
Assigning P value to boxplot comparisons
Every boxplot was plotted using values for all the transcripts/loci analyzed (number of transcripts/loci indicated in the boxplot). The difference in population sizes of compared transcript/loci types was accounted for by performing statistical tests on equalized population sizes. Namely, the larger population was randomly subsampled to match the size of smaller population and Mann–Whitney U test was applied to estimate significance of the difference between the populations with equalized sizes. Subsampling and statistical tests were performed three times for each comparison and the three P values obtained were averaged to give the resulting P value to be indicated on the boxplot.
Inter-individual expression variability analysis
Inter-individual expression variability was estimated by calculating standard deviation of expression between analyzed donors then normalizing it to the mean expression of the locus/transcript among all analyzed donors. For granulocytes we assessed variability between seven donors (expression of a locus/transcript in each donor was calculated as a mean of expression of the three time points of this donor). For LCL we assessed variability between 462 donors.
GEUVADIS project RNA-seq data analysis
We downloaded and aligned using a common pipeline all 462 PolyA+ 75 bp paired end RNA-seq raw sequencing datasets provided by GEUVADIS RNA-seq project (http://www.ebi.ac.uk/ena/data/view/ERR188021-ERR188482). The data contained donors from five populations (http://www.1000genomes.org/category/frequently-asked-questions/population). We picked two female and two male unrelated donors from each population and used RNA-seq from these 20 donors to assemble the LCL de novo lncRNA and mRNA transcriptome. We pooled the samples into five groups with a similar number of aligned spliced reads (Additional file 2I) and performed transcriptome assembly following the pipeline described for granulocytes. As the RNA-seq datasets were not strand-specific we used strand-specific PolyA+ RNA-seq of GM12878 from the ENCODE project (Additional file 2H) in the pipeline where needed. Quality assembly (Additional file 1: Figure S24B) was assessed as for granulocytes.
GTEx RNA-seq data analysis
Aligned (as described in ) RNA-seq data from the GTEx project (http://www.gtexportal.org/home/) were downloaded from dbGaP (https://dbgap.ncbi.nlm.nih.gov/) as described in (http://www.gtexportal.org/static/misc/GTEx_Poster_CommunityMeeting_TY.pdf) after we applied and were granted data access. We downloaded RNA-seq data for nine tissues (namely lymphoblastoid cell line (LCL), adipose, artery, cerebellum, heart, lung, muscle, nerve, and thyroid), from 10 male and 10 female individuals each (Additional file 2J). The aligned RNA-seq datasets were unstranded and ranged from 14.8 to 85.4 (average 52.1) million paired-end reads each. We calculated RPKM of MiTranscriptome annotated multi-exonic lncRNAs and mRNAs in all samples and performed variability analysis between 20 individuals per tissue.
Donor saturation curve
One hundred and twenty out of 462 GEUVADIS RNA-seq samples containing more than 25 million reads were picked for the analysis from 12 unrelated women and men from each of the five population groups. A total of 25 million reads were randomly sampled from each RNA-seq sample using DownsampleSam.jar (Picard tools http://broadinstitute.github.io/picard/command-line-overview.html#DownsampleSam). Donors were grouped into 30 groups each with two women plus two men from the same population and the reads from the four donors were pooled using MergeSamFiles.jar (Picard tools http://broadinstitute.github.io/picard/command-line-overview.html#MergeSamFiles) to produce 30 × 100 million read pools. Cufflinks was used to assemble a transcriptome from each pool (Additional file 1: Supplemental Methods) resulting in 30 transcriptome assemblies. Of these 30 assemblies, 1, 2, 3, 4, 5, 6, 8, 10, 15, 20, 25, or 30 assemblies were used to annotate de novo LCL transcriptomes from different number of donors (4, 8, 12, 16, 20, 24, 32, 40, 60, 80, 100, and 120, respectively) and to define the relation between the number of loci (Y axis) and the number of donors/assemblies (X axis). We randomly picked the needed number of assemblies from the list of 30. The random picking was performed three times for each number of assemblies (Additional file 1: Figure S31B), except when all 30 assemblies were used for the last point. The picked assemblies were then merged with Cuffmerge and underwent the previously established de novo annotation pipeline (Additional file 1: Supplemental Methods).
Peripheral blood samples were collected from healthy volunteers after written informed consent at the Vienna General Hospital (Allgemeines Krankenhaus der Stadt Wien, Klinische Abteilung für Hämatologie und Hämostaseologie). The study was approved by the local Ethics committee of the Medical University of Vienna (‘Ethik Kommission der Medizinischen Universität Wien’) and experimental methods comply with the Helsinki Declaration.
Availability of data
Raw granulocyte RNA-seq data, RPKM, and variability values for granulocyte de novo lncRNAs and mRNAs as well as their BED12 annotation files were deposited in NCBI's Gene Expression Omnibus  and are accessible through GEO Series accession number GSE70390 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70390). LncRNA annotations in granulocytes and LCL created in the study are available to directly download as Additional files in bed12 format. Genbank accession numbers for sequenced lncRNAs are listed in Additional file 2G.
We thank Ruth Klement, Tomasz Kulinski, Elisangela Valente, Elisabeth Salzer, and Roland Jäger for technical/bioinformatic assistance and advice, the CeMM IT department and José Manuel Molero for help and advice on software usage, the Biomedical Sequencing Facility (http://biomedical-sequencing.at/) for sequencing and advice, Jacques Colinge, Daniel Andergassen, and Tomasz Kulinski for discussions, Quanah Hudson and Jörg Menche for reading and commenting on the manuscript.
This study was partly funded by the Austrian Science Fund (FWF F43-B09, FWF W1207-B09). PMG is a recipient of a DOC Fellowship of the Austrian Academy of Sciences.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Morris KV, Mattick JS. The rise of regulatory RNA. Nat Rev Genet. 2014;15(6):423–37.PubMedPubMed CentralView ArticleGoogle Scholar
- Bonasio R, Shiekhattar R. Regulation of transcription by long noncoding RNAs. Annu Rev Genet. 2014;48:433–55.PubMedPubMed CentralView ArticleGoogle Scholar
- Geisler S, Coller J. RNA in unexpected places: long non-coding RNA functions in diverse cellular contexts. Nat Rev Mol Cell Biol. 2013;14(11):699–712.PubMedView ArticleGoogle Scholar
- Quinodoz S, Guttman M. Long noncoding RNAs: an emerging link between gene regulation and nuclear organization. Trends Cell Biol. 2014;24(11):651–63.PubMedPubMed CentralView ArticleGoogle Scholar
- Bergmann JH, Spector DL. Long non-coding RNAs: modulators of nuclear structure and function. Curr Opin Cell Biol. 2014;26:10–8.PubMedView ArticleGoogle Scholar
- Batista PJ, Chang HY. Long noncoding RNAs: cellular address codes in development and disease. Cell. 2013;152(6):1298–307.PubMedPubMed CentralView ArticleGoogle Scholar
- Ng SY, Lin L, Soh BS, Stanton LW. Long noncoding RNAs in development and disease of the central nervous system. Trends Genet. 2013;29(8):461–8.PubMedView ArticleGoogle Scholar
- Lee JT, Bartolomei MS. X-inactivation, imprinting, and long noncoding RNAs in health and disease. Cell. 2013;152(6):1308–23.PubMedView ArticleGoogle Scholar
- Wahlestedt C. Targeting long non-coding RNA to therapeutically upregulate gene expression. Nat Rev Drug Discov. 2013;12(6):433–46.PubMedView ArticleGoogle Scholar
- Cheetham SW, Gruhl F, Mattick JS, Dinger ME. Long noncoding RNAs and the genetics of cancer. Br J Cancer. 2013;108(12):2419–25.PubMedPubMed CentralView ArticleGoogle Scholar
- Thum T. Noncoding RNAs and myocardial fibrosis. Nat Rev Cardiol. 2014;11(11):655–63.PubMedView ArticleGoogle Scholar
- Quagliata L, Matter MS, Piscuoglio S, Arabi L, Ruiz C, Procino A, et al. Long noncoding RNA HOTTIP/HOXA13 expression is associated with disease progression and predicts outcome in hepatocellular carcinoma patients. Hepatology. 2014;59(3):911–23.PubMedPubMed CentralView ArticleGoogle Scholar
- Malik R, Patel L, Prensner JR, Shi Y, Iyer MK, Subramaniyan S, et al. The lncRNA PCAT29 inhibits oncogenic phenotypes in prostate cancer. Mol Cancer Res. 2014;12(8):1081–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 2011;25(18):1915–27.PubMedPubMed CentralView ArticleGoogle Scholar
- Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. 2009;458(7235):223–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Pervouchine DD, Djebali S, Breschi A, Davis CA, Barja PP, Dobin A, et al. Enhanced transcriptome maps from multiple mouse tissues reveal evolutionary constraint in gene expression. Nat Commun. 2015;6:5903.PubMedPubMed CentralView ArticleGoogle Scholar
- Ulitsky I, Shkumatava A, Jan CH, Sive H, Bartel DP. Conserved function of lincRNAs in vertebrate embryonic development despite rapid sequence evolution. Cell. 2011;147(7):1537–50.PubMedPubMed CentralView ArticleGoogle Scholar
- Pauli A, Valen E, Lin MF, Garber M, Vastenhouw NL, Levin JZ, et al. Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis. Genome Res. 2012;22(3):577–91.PubMedPubMed CentralView ArticleGoogle Scholar
- Li T, Wang S, Wu R, Zhou X, Zhu D, Zhang Y. Identification of long non-protein coding RNAs in chicken skeletal muscle using next generation sequencing. Genomics. 2012;99(5):292–8.PubMedView ArticleGoogle Scholar
- Tan MH, Au KF, Yablonovitch AL, Wills AE, Chuang J, Baker JC, et al. RNA sequencing reveals a diverse and dynamic repertoire of the Xenopus tropicalis transcriptome over development. Genome Res. 2013;23(1):201–16.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu J, Wang H, Chua NH. Long noncoding RNA transcriptome of plants. Plant Biotechnol J. 2015;13(3):319–28.PubMedView ArticleGoogle Scholar
- Nam JW, Bartel DP. Long noncoding RNAs in C. elegans. Genome Res. 2012;22(12):2529–40.PubMedPubMed CentralView ArticleGoogle Scholar
- van Werven FJ, Neuert G, Hendrick N, Lardenois A, Buratowski S, van Oudenaarden A, et al. Transcription of two long noncoding RNAs mediates mating-type control of gametogenesis in budding yeast. Cell. 2012;150(6):1170–81.PubMedPubMed CentralView ArticleGoogle Scholar
- Kim T, Xu Z, Clauder-Munster S, Steinmetz LM, Buratowski S. Set3 HDAC mediates effects of overlapping noncoding transcription on gene induction kinetics. Cell. 2012;150(6):1158–69.PubMedPubMed CentralView ArticleGoogle Scholar
- Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 2012;22(9):1775–89.PubMedPubMed CentralView ArticleGoogle Scholar
- Hangauer MJ, Vaughn IW, McManus MT. Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs. PLoS Genet. 2013;9(6):e1003569.PubMedPubMed CentralView ArticleGoogle Scholar
- Ranzani V, Rossetti G, Panzeri I, Arrigoni A, Bonnal RJ, Curti S, et al. The long intergenic noncoding RNA landscape of human lymphocytes highlights the regulation of T cell differentiation by linc-MAF-4. Nat Immunol. 2015;16(3):318–25.PubMedPubMed CentralView ArticleGoogle Scholar
- Amin V, Harris RA, Onuchic V, Jackson AR, Charnecki T, Paithankar S, et al. Epigenomic footprints across 111 reference epigenomes reveal tissue-specific epigenetic regulation of lincRNAs. Nat Commun. 2015;6:6370.PubMedPubMed CentralView ArticleGoogle Scholar
- Iyer MK, Niknafs YS, Malik R, Singhal U, Sahu A, Hosono Y, et al. The landscape of long noncoding RNAs in the human transcriptome. Nat Genet. 2015;47(3):199–208.PubMedPubMed CentralView ArticleGoogle Scholar
- Mercer TR, Mattick JS. Structure and function of long noncoding RNAs in epigenetic regulation. Nat Struct Mol Biol. 2013;20(3):300–7.PubMedView ArticleGoogle Scholar
- Quek XC, Thomson DW, Maag JL, Bartonicek N, Signal B, Clark MB, et al. lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res. 2015;43(Database issue):D168–73.PubMedPubMed CentralView ArticleGoogle Scholar
- Kowalczyk MS, Higgs DR, Gingeras TR. Molecular biology: RNA discrimination. Nature. 2012;482(7385):310–1.PubMedView ArticleGoogle Scholar
- Raabe CA, Brosius J. Does every transcript originate from a gene? Ann N Y Acad Sci. 2015;1341:136–48.PubMedView ArticleGoogle Scholar
- Ulitsky I, Bartel DP. lincRNAs: genomics, evolution, and mechanisms. Cell. 2013;154(1):26–46.PubMedPubMed CentralView ArticleGoogle Scholar
- Chu C, Spitale RC, Chang HY. Technologies to probe functions and mechanisms of long noncoding RNAs. Nat Struct Mol Biol. 2015;22(1):29–35.PubMedView ArticleGoogle Scholar
- Bassett AR, Akhtar A, Barlow DP, Bird AP, Brockdorff N, Duboule D, et al. Considerations when investigating lncRNA function in vivo. Elife. 2014;3:e03058.PubMedPubMed CentralView ArticleGoogle Scholar
- Mudge JM, Frankish A, Harrow J. Functional transcriptomics in the post-ENCODE era. Genome Res. 2013;23(12):1961–73.PubMedPubMed CentralView ArticleGoogle Scholar
- Mattick JS, Rinn JL. Discovery and annotation of long noncoding RNAs. Nat Struct Mol Biol. 2015;22(1):5–7.PubMedView ArticleGoogle Scholar
- Hezroni H, Koppstein D, Schwartz MG, Avrutin A, Bartel DP, Ulitsky I. Principles of long noncoding RNA evolution derived from direct comparison of transcriptomes in 17 species. Cell Rep. 2015;11(7):1110–22.PubMedView ArticleGoogle Scholar
- Kornienko AE, Guenzl PM, Barlow DP, Pauler FM. Gene regulation by the act of long non-coding RNA transcription. BMC Biol. 2013;11:59.PubMedPubMed CentralView ArticleGoogle Scholar
- Rinn JL, Chang HY. Genome regulation by long noncoding RNAs. Annu Rev Biochem. 2012;81:145–66.PubMedView ArticleGoogle Scholar
- Guttman M, Russell P, Ingolia NT, Weissman JS, Lander ES. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell. 2013;154(1):240–51.PubMedPubMed CentralView ArticleGoogle Scholar
- Orom UA, Derrien T, Beringer M, Gumireddy K, Gardini A, Bussotti G, et al. Long noncoding RNAs with enhancer-like function in human cells. Cell. 2010;143(1):46–58.PubMedPubMed CentralView ArticleGoogle Scholar
- Kelley D, Rinn J. Transposable elements reveal a stem cell-specific class of long noncoding RNAs. Genome Biol. 2012;13(11):R107.PubMedPubMed CentralView ArticleGoogle Scholar
- Tilgner H, Knowles DG, Johnson R, Davis CA, Chakrabortty S, Djebali S, et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Res. 2012;22(9):1616–25.PubMedPubMed CentralView ArticleGoogle Scholar
- Clark MB, Johnston RL, Inostroza-Ponta M, Fox AH, Fortini E, Moscato P, et al. Genome-wide analysis of long noncoding RNA stability. Genome Res. 2012;22(5):885–98.PubMedPubMed CentralView ArticleGoogle Scholar
- Kutter C, Watt S, Stefflova K, Wilson MD, Goncalves A, Ponting CP, et al. Rapid turnover of long noncoding RNAs and the evolution of gene expression. PLoS Genet. 2012;8(7):e1002841.PubMedPubMed CentralView ArticleGoogle Scholar
- Johnsson P, Lipovich L, Grander D, Morris KV. Evolutionary conservation of long non-coding RNAs; sequence, structure, function. Biochim Biophys Acta. 2014;1840(3):1063–71.PubMedPubMed CentralView ArticleGoogle Scholar
- Necsulea A, Kaessmann H. Evolutionary dynamics of coding and non-coding transcriptomes. Nat Rev Genet. 2014;15(11):734–48.PubMedView ArticleGoogle Scholar
- Lappalainen T, Sammeth M, Friedlander MR, t Hoen PA, Monlong J, Rivas MA, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501(7468):506–11.PubMedPubMed CentralView ArticleGoogle Scholar
- Spielman RS, Bastone LA, Burdick JT, Morley M, Ewens WJ, Cheung VG. Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet. 2007;39(2):226–31.PubMedPubMed CentralView ArticleGoogle Scholar
- Storey JD, Madeoy J, Strout JL, Wurfel M, Ronald J, Akey JM. Gene-expression variation within and among human populations. Am J Hum Genet. 2007;80(3):502–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Whitney AR, Diehn M, Popper SJ, Alizadeh AA, Boldrick JC, Relman DA, et al. Individuality and variation in gene expression patterns in human blood. Proc Natl Acad Sci U S A. 2003;100(4):1896–901.PubMedPubMed CentralView ArticleGoogle Scholar
- Dumeaux V, Olsen KS, Nuel G, Paulssen RH, Borresen-Dale AL, Lund E. Deciphering normal blood gene expression variation--The NOWAC postgenome study. PLoS Genet. 2010;6(3):e1000873.PubMedPubMed CentralView ArticleGoogle Scholar
- Mele M, Ferreira PG, Reverter F, DeLuca DS, Monlong J, Sammeth M, et al. Human genomics. The human transcriptome across tissues and individuals. Science. 2015;348(6235):660–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21.PubMedPubMed CentralView ArticleGoogle Scholar
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7(3):562–78.PubMedPubMed CentralView ArticleGoogle Scholar
- Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22(9):1760–74.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014;42(Database issue):D756–763.PubMedPubMed CentralView ArticleGoogle Scholar
- Washietl S, Findeiss S, Muller SA, Kalkhof S, von Bergen M, Hofacker IL, et al. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA. 2011;17(4):578–94.PubMedPubMed CentralView ArticleGoogle Scholar
- Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35(Web Server issue):W345–349.PubMedPubMed CentralView ArticleGoogle Scholar
- Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. Landscape of transcription in human cells. Nature. 2012;489(7414):101–8.PubMedPubMed CentralView ArticleGoogle Scholar
- Chambers JM, Freeny A, Heiberger RM. Analysis of variance; designed experiments. Pacific Grove, CA: Wadsworth & Brooks/Cole; 1992.Google Scholar
- Consortium TG. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–60.View ArticleGoogle Scholar
- Ravasi T, Suzuki H, Pang KC, Katayama S, Furuno M, Okunishi R, et al. Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res. 2006;16(1):11–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Wilusz JE. Long noncoding RNAs: Re-writing dogmas of RNA processing and stability. Biochim Biophys Acta. 1859;2016:128–38.Google Scholar
- Seidl CI, Stricker SH, Barlow DP. The imprinted Air ncRNA is an atypical RNAPII transcript that evades splicing and escapes nuclear export. Embo J. 2006;25(15):3565–75.PubMedPubMed CentralView ArticleGoogle Scholar
- Meng L, Person RE, Beaudet AL. Ube3a-ATS is an atypical RNA polymerase II transcript that represses the paternal expression of Ube3a. Hum Mol Genet. 2012;21(13):3001–12.PubMedPubMed CentralView ArticleGoogle Scholar
- Gonzalez-Porta M, Calvo M, Sammeth M, Guigo R. Estimation of alternative splicing variability in human populations. Genome Res. 2012;22(3):528–38.PubMedPubMed CentralView ArticleGoogle Scholar
- Glass D, Vinuela A, Davies MN, Ramasamy A, Parts L, Knowles D, et al. Gene expression changes with age in skin, adipose tissue, blood and brain. Genome Biol. 2013;14(7):R75.PubMedPubMed CentralView ArticleGoogle Scholar
- Santoro F, Mayer D, Klement RM, Warczok KE, Stukalov A, Barlow DP, et al. Imprinted Igf2r silencing depends on continuous Airn lncRNA expression and is not restricted to a developmental window. Development. 2013;140(6):1184–95.PubMedView ArticleGoogle Scholar
- Goff LA, Rinn JL. Linking RNA biology to lncRNAs. Genome Res. 2015;25(10):1456–65.PubMedPubMed CentralView ArticleGoogle Scholar
- Necsulea A, Soumillon M, Warnefors M, Liechti A, Daish T, Zeller U, et al. The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature. 2014;505(7485):635–40.PubMedView ArticleGoogle Scholar
- Kwan T, Benovoy D, Dias C, Gurd S, Provencher C, Beaulieu P, et al. Genome-wide analysis of transcript isoform variation in humans. Nat Genet. 2008;40(2):225–31.PubMedView ArticleGoogle Scholar
- Martin AR, Costa HA, Lappalainen T, Henn BM, Kidd JM, Yee MC, et al. Transcriptome sequencing from diverse human populations reveals differentiated regulatory architecture. PLoS Genet. 2014;10(8):e1004549.PubMedPubMed CentralView ArticleGoogle Scholar
- Sigova AA, Mullen AC, Molinie B, Gupta S, Orlando DA, Guenther MG, et al. Divergent transcription of long noncoding RNA/mRNA gene pairs in embryonic stem cells. Proc Natl Acad Sci U S A. 2013;110(8):2876–81.PubMedPubMed CentralView ArticleGoogle Scholar
- Hacisuleyman E, Goff LA, Trapnell C, Williams A, Henao-Mejia J, Sun L, et al. Topological organization of multichromosomal regions by the long intergenic noncoding RNA Firre. Nat Struct Mol Biol. 2014;21(2):198–206.PubMedPubMed CentralView ArticleGoogle Scholar
- Ponjavic J, Ponting CP, Lunter G. Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs. Genome Res. 2007;17(5):556–65.PubMedPubMed CentralView ArticleGoogle Scholar
- Kapranov P, St Laurent G, Raz T, Ozsolak F, Reynolds CP, Sorensen PH, et al. The majority of total nuclear-encoded non-ribosomal RNA in a human cell is ‘dark matter’ un-annotated RNA. BMC Biol. 2010;8:149.PubMedPubMed CentralView ArticleGoogle Scholar
- St Laurent G, Shtokalo D, Tackett MR, Yang Z, Eremina T, Wahlestedt C, et al. Intronic RNAs constitute the major fraction of the non-coding RNA in mammalian cells. BMC Genomics. 2012;13:504.PubMedPubMed CentralView ArticleGoogle Scholar
- Khalil AM, Guttman M, Huarte M, Garber M, Raj A, Rivea Morales D, et al. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc Natl Acad Sci U S A. 2009;106(28):11667–72.PubMedPubMed CentralView ArticleGoogle Scholar
- Wright MW. A short guide to long non-coding RNA gene nomenclature. Hum Genomics. 2014;8:7.PubMedPubMed CentralView ArticleGoogle Scholar
- St Laurent G, Wahlestedt C, Kapranov P. The Landscape of long noncoding RNA classification. Trends Genet. 2015;31(5):239–51.PubMedView ArticleGoogle Scholar
- Sultan M, Dokel S, Amstislavskiy V, Wuttig D, Sultmann H, Lehrach H, et al. A simple strand-specific RNA-Seq library preparation protocol combining the Illumina TruSeq RNA and the dUTP methods. Biochem Biophys Res Commun. 2012;422(4):643–6.PubMedView ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30(1):207–10.PubMedPubMed CentralView ArticleGoogle Scholar