Long non-coding RNAs display higher natural expression variation than protein-coding genes in healthy humans

Background Long non-coding RNAs (lncRNAs) are increasingly implicated as gene regulators and may ultimately be more numerous than protein-coding genes in the human genome. Despite large numbers of reported lncRNAs, reference annotations are likely incomplete due to their lower and tighter tissue-specific expression compared to mRNAs. An unexplored factor potentially confounding lncRNA identification is inter-individual expression variability. Here, we characterize lncRNA natural expression variability in human primary granulocytes. Results We annotate granulocyte lncRNAs and mRNAs in RNA-seq data from 10 healthy individuals, identifying multiple lncRNAs absent from reference annotations, and use this to investigate three known features (higher tissue-specificity, lower expression, and reduced splicing efficiency) of lncRNAs relative to mRNAs. Expression variability was examined in seven individuals sampled three times at 1- or more than 1-month intervals. We show that lncRNAs display significantly more inter-individual expression variability compared to mRNAs. We confirm this finding in two independent human datasets by analyzing multiple tissues from the GTEx project and lymphoblastoid cell lines from the GEUVADIS project. Using the latter dataset we also show that including more human donors into the transcriptome annotation pipeline allows identification of an increasing number of lncRNAs, but minimally affects mRNA gene number. Conclusions A comprehensive annotation of lncRNAs is known to require an approach that is sensitive to low and tight tissue-specific expression. Here we show that increased inter-individual expression variability is an additional general lncRNA feature to consider when creating a comprehensive annotation of human lncRNAs or proposing their use as prognostic or disease markers. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0873-8) contains supplementary material, which is available to authorized users.

. Quality of de novo transcriptome annotation in granulocytes A. Assembling known lncRNAs -UCSC browser screen shot of an example of complete coverage of a well-known XIST lncRNA by de novo lncRNA annotation in granulocytes. From top to bottom: RefSeq gene annotation (the antisense TSIX lncRNA is cropped at the 5' end), de novo lncRNA loci annotation, de novo lncRNA transcript annotation (the pipeline annotates various isoforms of XIST), de novo mRNA loci and transcript annotation (empty in this genomic region), normalized PolyA+ RNA-seq signal for granulocytes from donor n2 (time point 2) and ribosomal depleted (Ribo Zero) RNA-seq of the same granulocyte sample (Additional File 2B). Both RNA-seq tracks show the presence of XIST and an absence of TSIX in granulocytes of Donor 2 (Additional File 2A). B and C. Completeness of de novo assembly. Exonic coverage (Methods) of RefSeq (B) and GENCODE v19 (C) annotated mRNAs by de novo mRNA annotation in granulocytes. Genes are split into 5 bins, according to their average expression level in PolyA+ granulocyte RNA-seq samples used for transcriptome assembly (Additional File 2B), in order to account for the bias between expression level and assembly success. Median levels from left to right: B: 97. 2, 98.1, 98.8, 98.7, 98.4; C: 95.3, 96.6, 97.9, 97.9, 98.2. Outliers are not displayed in the boxplots. A. The majority of de novo lncRNAs annotated in granulocytes are supported by an overlap with an EST. The bar plot shows the percentage of transcripts that have a sense exonic overlap with a spliced EST (human EST database downloaded from UCSC) for (from left to right): de novo lncRNAs annotated in this study in granulocytes, lncRNAs annotated by RefSeq and GENCODE v19, de novo mRNAs annotated in this study in granulocytes, mRNAs annotated by RefSeq and GENCODE v19. Numbers inside bars indicate total number of transcripts in each annotation. B. The majority of de novo lncRNAs in granulocytes are supported by an overlap with MiTranscriptome lncRNAs. The bar plot shows percentage of granulocyte de novo lncRNA (green) and mRNA (blue) transcripts supported by an exonic overlap with a multi-exonic lncRNA or mRNA respectively from MiTranscriptome, GENCODE v19, RefSeq and Cabili et al (only lncRNAs) annotations [2][3][4][5]  Cloning result for each lncRNA locus de novo annotated in granulocytes is represented by a UCSC browser screen shot. Each screen shot contains (from top to bottom): chromosome scale, chromosome coordinates, RefSeq gene annotation, granulocyte de novo mRNA loci annotation obtained in the study (blue), granulocyte de novo lncRNA loci annotation obtained in the study (green), de novo lncRNA transcripts constituting the given lncRNA locus (green), RT-PCR primers used to amplify the targeted full length lncRNA transcript (Additional File 2F), BLAT alignments of Sanger sequences obtained after cloning the targeted lncRNA transcript (black). For the loci containing several isoforms, those used for primer design are marked with (*). RT-PCR and cloning procedure do not preserve strandness of the transcript and, thus, the BLAT alignments' strands do not necessarily match the corresponding lncRNA strand. Sanger sequencing results ("_T7" tag for T7 primer used) from some cloned products could not cover the full-length transcript and these products were sequenced from the other end ("_SP6" tag for SP6 primer used). Red lines in BLAT alignments indicate mismatches between the UCSC reference genome sequence and the sequence of the aligned Sanger sequences, the width of a red line does not scale with the size of the mismatch.   1 GRA_denovo_LNC.294.2 GRA_denovo_LNC.294.3 GRA_denovo_LNC.294.4 GRA_denovo_LNC.294.5 GRA_denovo_LNC.294.6 GRA_denovo_LNC.294.7 GRA_denovo_LNC.294.8 GRA_denovo_LNC.294.9 GRA_denovo_LNC.294.10 GRA_denovo_LNC.294.18 GRA_denovo_LNC.294.16 GRA_denovo_LNC.294.17 GRA_denovo_LNC.294.15 GRA_denovo_LNC.294.13 GRA_denovo_LNC.294.14 GRA_denovo_LNC.294.12 GRA_denovo_LNC.294.11 GRA_denovo_LNC.294.19 GRA_denovo_LNC.294.20 GRA_denovo_LNC.294.21 GRA_denovo_LNC.294.22 GRA_denovo_LNC.294.23 GRA_denovo_LNC.294.24 GRA_denovo_LNC.294.25 GRA_denovo_LNC.294.26 GRA_denovo_LNC.294.27 GRA_denovo_LNC.294.28 GRA_denovo_LNC.294.29 GRA_denovo_LNC.294.30 GRA_denovo_LNC.294          Figure S11. Features that distinguish granulocyte lncRNAs from mRNAs also distinguish novel granulocyte lncRNAs from publicly annotated lncRNAs A. Abundance in total RNA fraction of granulocyte de novo lncRNA and mRNA transcripts was calculated as an average from all 21 available Ribosomal Depleted RNA-seq samples. Exon number (B), percent GC content of transcription start site (TSS) region (TSS +/-1.5 kb) (C), percent repeat coverage of TSS region (D) and exons (E) of de novo annotated lncRNAs (green) and mRNA (blue) transcripts. Abundance in total RNA fraction (F), exon number (G), percent GC content of TSS region (H), percent repeat coverage of TSS region (I) and exons (J) of the three classes of de novo lncRNAs (described in Fig. 2A). C, D and E: "shuffled" control (white) is added to the box plots. Shuffled control represents random regions in the genome using bedtools shuffle. C and D:

Figure S12
Kornienko et al., Supplemental Figure 11. Difference between mRNAs and lncRNAs is persistent independently of expression level splicing efficiency,%  Figure S12. Difference between mRNAs and lncRNAs is persistent independently of expression level A. PolyA+ enrichment (calculated as in Figure 2E) of granulocyte de novo lncRNA and mRNA transcripts split into expression bins. Median values in boxes from left to right: 1.4, 3.5, 1.5, 2.9, 1.5, 2.8, 1.2, 2.5, 1.0, 1.9. B. Splicing efficiency (calculated as in Figure 2F)

Figure S13
Kornienko et al., Supplemental Figure 11. Splicing efficiency calculation  Figure S13. Splicing efficiency calculation A. Overview of splicing efficiency calculation (Methods). Shown is an illustrative RNA-seq signal track of a genomic locus. Splicing efficiency of each splice site in the locus is calculated based on the ratio between intronic and exonic RNA-seq signal at this splice site. RNA-seq signal is calculated as RPKM of 45bp exonic and intronic regions surrounding the splice site. 45bp are positioned 5bp away from the precise splice site position to accommodate for potential imprecise splice site annotation. Splicing efficiency is calculated with the given formula. In case of intronic signal exceeding exonic signal, splicing efficiency is set to 0. Maximal splice site splicing efficiency is taken to estimate splicing efficiency of each transcript, as well as of the whole locus.

B.
RT-PCRs to test splicing efficiency calculation (see Supplemental Methods). RT-PCR result for one junction (junction length<1500bp) of eight inefficiently spliced (upper row gel pictures, see Supplemental Methods) and efficiently spliced (lower row gel pictures, see Supplemental Methods) de novo granulocyte transcripts (4xlncRNAss and 4xmRNAs). List of assayed lncRNA and mRNA transcripts and primers to amplify their short junction is given in the table below. Size of the expected PCR product amplifying the unspliced junction is given in the right-most column of the table. Each primer pair was used with granulocyte cDNA sample and three control samples: genomic DNA (from the same granulocyte sample) -to test efficiency of the primers when amplifying the long unspliced isoform, water -to test PCR reaction contamination, -RT (no reverse transcriptase added when preparing cDNA) control -to test for genomic DNA contamination in the cDNA sample. Red stars on the left from the band indicate the expected band corresponding to the unspliced product. Blue stars on the left from the band indicate spliced products. lncRNA gra91 and lncRNA gra1342 -absence of a band in gDNA indicates inefficiency of primers. mRNA gra1415 -band marked "p" might indicate the presence of a pseudogene corresponding to the assayed protein-coding gene somewhere in the genome. Overall, the RT-PCR test validates our splicing efficiency calculation with transcripts identified as unspliced showing an abundant unspliced isoform (5 out of 6 with one (mRNA gra1793) exception) and transcripts identified as spliced showing no unspliced isoform (6 out of 8 with two exceptions (lncRNA gra1168 and mRNA mRNA gra3788) showing some unspliced product signal, however it is much fainter than the band formed by the spliced products). Note that, as described in (A), we assign a transcript with the splicing efficiency of its best spliced splice site, whereas the RT-PCR assay allows to assess splicing efficiency of only the short (<1.5kb) splice sites, which might explain the slight discrepancy in the results presented in (B).

C.
Analysis of splicing efficiency of the whole locus confirms reduced splicing efficiency of granulocyte de novo lncRNAs compared to mRNAs. Boxplot shows splicing efficiency (as described in A, Methods) of de novo lncRNA (green) and mRNA (blue) loci annotated in granulocytes. Splicing efficiency of each splice site was calculated (Methods) and the efficiency of the most efficiently spliced transcript (i.e. most efficiently spliced site of all the transcripts) in each locus is plotted. Median values: lncRNAs: 63.46%, mRNAs: 99.07%.

D.
New lncRNA loci are less efficiently spliced than known lncRNA loci. Boxplot shows splicing efficiency (as described in A, Methods) of new (light grey) and known (dark grey) (as described in Fig

E.
Two illustrative examples of exon spanning RT-PCR amplifying a continuous unspliced isoform. Gel electrophoresis of RT-PCR over splice junctions in granulocyte de novo lncRNA loci 350 and 720 (See Additional File 2E for primer sequence and junction genomic position and Additional File 3 for locus annotation). Above the gel picture, splicing efficiency calculated from RNA-seq data as described in (A) is shown. j: junction number, w: PCR water control. *: bands corresponding to spliced products, u: bands corresponding to the unspliced product. Primer span: the genomic span of PCR primers corresponding to the length of the unspliced PCR product.
Remarks to boxplots: Numbers inside boxes indicate number of loci displayed in each box. The box plots display the full population but p-values are calculated using Mann-Whitney U test with equalized sample size (Methods). *-p<0.01, *** -p<10 -16 . Outliers are not displayed. B. PolyA+ enrichment of multi-exonic MiTranscriptome lncRNA (green) and mRNA (blue) transcripts as described in Fig. 2E. Only transcripts detected in total granulocyte RNA-seq data (average RPKM among 21 samples >0.2) are analyzed.
C. Splicing efficiency of multi-exonic MiTranscriptome lncRNA (green) and mRNA (blue) transcripts expressed in granulocytes. Splicing efficiency was calculated using ribosomal-depleted RNA-seq from 7 donors (time points pooled to increase the coverage) (Methods). The splicing efficiency of the most efficiently spliced site in each transcript is plotted.

D.
Distribution of multi-exonic MiTranscriptome lncRNA transcripts according to their coverage in the 3 commonly used public annotations as described in Fig Remarks to boxplots A, B, C, E, F and G: Numbers on the right indicate the numbers of transcripts analyzed in each boxplot. The box plots display the full population but p-values are calculated using Mann-Whitney U test with equalized sample size (Methods). *** -p<10 -10 , ** -p<10 -5 , * -p<0.  Only MiTranscriptome transcripts expressed in granulocyte total RNA-seq dataset (average RPKM>0.2) are plotted. Exon number (B), percent GC content of TSS region (TSS +/-1.5 kb) (C), percent repeat coverage of TSS region (D) and exons (E) of multi-exonic MiTranscriptome lncRNAs and mRNA transcripts. Abundance in total RNA fraction in granulocytes (F), exon number (G), percent GC content of TSS region (H), percent repeat coverage of TSS region (I) and exons (J) of the three classes of MiTranscriptome lncRNA as described in Figure S14D. Remarks: green -multiexonic MiTranscriptome lncRNAs, blue -multi-exonic MiTranscriptome mRNAs, light gray -"not in PA" lncRNA transcripts, medium gray -"isoform not in PA" lncRNA transcripts, dark gray -"PA" lncRNA transcripts. Numbers of transcripts in each box of the boxplots are indicated on the right. *** -p<10 -10 , ** -p<10 -5 , * -p<0.01, n.s.

Figure S16
Kornienko et al., Supplemental Figure 16. Intra-individual variability is significantly lower than inter-individual variability for both lncRNAs and mRNAs in granulocytes   The boxplot shows values of different types of expression variability of granulocyte de novo annotated lncRNA (green) and mRNA (blue) transcripts in granulocyte total RNA-seq dataset. We calculated intra-individual variability (between 3 replicates of each donor), compared it to interindividual variability (between 7 donors, data displayed in Fig. 4A, here indicated as transparent boxes). We then controlled for the reduced sample size in intra-individual variability calculation (3 samples for intra-individual vs. 7 samples for inter-individual variability) by randomly sampling 3 replicates and asking if the inter-individual variability observed is reduced with the reduced sample size. Graphical representation of calculation of the three variability types plotted is displayed below the boxplot. From top to bottom: Intra-individual variability (variability "between 3 replicates"): standard deviation was calculated for each donor by calculating standard deviation of 3 replicates and normalized by the mean of the 3 replicates, 7 normalized standard deviation values from 7 donors were then averaged to give intra-individual variability for each transcript. Control (variability between 3 random replicates): 3 replicates were randomly picked from the 21 (7 donors x 3 replicates) samples, normalized standard deviation was calculated for the 3 samples, the random sampling was performed 5 times and the average of 5 normalized standard deviations was calculated. Interindividual variability (variability between 7 donors): calculated as described for Fig. 4A

Figure S17
Kornienko et al., Supplemental Figure 16. lncRNAs are more variable than mRNAs independent of expression

Figure S23
normalized standard deviation Expression variability (7 donors Expression variability (7 donors           The phenomenon of increased expression variability of lncRNAs compared to mRNAs is not biased to the absolute expression level of two types of transcripts in the LCL dataset. Normalized standard deviation of de novo LCL lncRNA (green) and mRNA (blue) transcripts (top) or loci (bottom) expression in LCL from 462 donors split into 5 expression bins according to their maximal expression level (RPKM) among 462 donors. Note, that in bin 1 of the upper box plot the difference between lncRNAs and mRNAs is not significant, in contrast to similar analysis performed in granulocytes ( Figure S17A). This is likely caused by the increased number of donors (462 -LCL vs. 7granulocytes) and the way we split transcripts/loci into bins by their maximal expression among all donors. Thus, bin 1 most likely represents not expressed or marginally expressed transcripts, with an outlier reaching the maximal RPKM of 0.5 to 1, whose variability is strongly affected by detection bias. Such technical bias affects any transcript equally and thus lncRNA and mRNA expression variability is indistinguishable in bin 1. Note that variability of expression calculated over the whole locus (bottom plot) shows consistent lncRNA/mRNA difference in all bins. Remarks: Chr. X and Y were discarded from the analysis. *** -p<10 -10 , ** -p<10 -10 , * -p<0.01, n.s.  Antisense lncRNAs expression analysis in the LCL dataset (particularly that of expression over the whole locus (B)) could be biased to the expression of the overlapped mRNAs since the RNAseq data was not strand-specific and thus expression variability of antisense lncRNAs is reduced accordant to the reduced mRNA expression variability (Fig. 5D).   Figure S30. Confirmation of increased lncRNA expression variability in multiple human tissues using GTEx project data: expression level control Binned normalized standard deviation of MiTranscriptome lncRNA (green) and mRNA (blue) transcripts expression between 20 donors in 9 tissues (as described for Fig. 6, Results, Methods). Transcripts were split into 5 expression bins according to their maximal expression level (RPKM) among 20 donors for each tissue. Chromosomes X and Y were discarded from the analysis. Remarks to the boxplots: *** -p<10 -16 , ** -p<10 -10 , * -p<0.01, n.s. -p>0.01. The box plots display the full population but p-values are calculated using Mann-Whitney U test with equalized sample size (Methods  A. Algorithm for investigating the relation between the number of identified lncRNA loci and the number of donors analyzed. 120 donors (out of total 462 donors available from Geuvadis dataset [2]) were picked to be used in the study. Only unrelated samples with > 25 million reads were used. Each of the five population groups was represented by 12 females and 12 males (Additional File 11A). All the 120 RNA-seq data sets were randomly down-sampled to give 25 million paired-end reads each. 120 donors were grouped into 30 pools (each pool contained 2 females and 2 males from the same population) of 100 (25x4) million reads each. Each pool was used to assemble LCL transcriptome using Cufflinks resulting in 30 transcriptome assemblies. We then used 1,2,3,4,5,6,8,10,15,20,25, 30 transcriptome assemblies to de novo annotate lncRNAs and mRNAs in LCL using the annotation pipeline established in the study and plot the number of annotated loci vs. number of donors as an output of the analysis (see Figure 7C and Additional File 11C).

B.
For each number of assemblies (i.e. each data point) we performed random picking from the list of 30 assemblies (Additional File 11B).
C. Number of transcript isoforms increases similarly for lncRNA and mRNA loci with the increase of number of donors used for transcript annotation. The plot shows number of LCL de novo lncRNA (green) and mRNA (blue) transcripts annotated using different number of assemblies obtained from different number of donors. The Y-axis corresponding to lncRNAs (green) is placed on the left, the Yaxis corresponding to mRNAs (blue) is placed on the right. The range of values from 0 to 25,000 transcripts for lncRNAs and 14x fold more -from 0 to 350,000 for mRNAs. In spite of differing absolute numbers the dynamics of the increase is the same for lncRNAs and mRNAs. Maximum number of lncRNA transcripts -20,992, maximum number of mRNA transcripts -330,811 (data table Additional File 11C). Error bars that represent standard deviation in transcript number between three replicates of random assembly picking (B) are present for all data points but mostly not visible due to their low values.

D.
Increasing donor numbers allows identification of an increasing number of isoforms per mRNA locus, whereas lncRNAs keep low median number of transcripts per locus while increasing the number of loci annotated in the genome. The plot shows the median transcript number in LCL de novo lncRNA (green) and mRNA (blue) loci annotated using different number of assemblies obtained from different number of donors. Error bars represent the standard deviation of loci number between 3 replicates of random picking for each number of assemblies used for identification (data table Additional File 11C). Error bars that represent standard deviation between three replicates are present for all data points but mostly not visible due to their low values.

E.
Estimating how many unknown mRNA loci could be assembled and then discarded at the proteincoding capacity filtering step with the increasing number of donors.

Expression of the lncRNAs found with certain donor number but not found when analyzing less donors (RPKM of these lncRNAs in the respective donors used for identification -averaged between the donors)
found in 8d but not in 4d found in 12d but not in 8d found in 16d but not in 12d found in 20d but not in 16d found in 24d but not in 20d found in 32d but not in 24d found in 40d but not in 32d found in 60d but not in 40d found in 80d but not in 60d found in 100d but not in 80d found in 120d but not in 100d found in 8d but not in 4d found in 12d but not in 8d found in 16d but not in 12d found in 20d but not in 16d found in 24d but not in 20d found in 32d but not in 24d found in 40d but not in 32d found in 60d but not in 40d found in 80d but not in 60d found in 100d but not in 80d found in 120d but not in 100d lncRNAs

Expression of the lncRNAs found with certain donor number but not found when analyzing less donors (RPKM of these lncRNAs in the respective donors used for identification )
Supplemental Overview: Expression level of the identified lncRNAs (Fig. 7C) is not decreasing with the increase of donor number and potential sensitivity of the pipeline. We asked if the lncRNA loci that we identified by adding more donors to the analysis were more lowly expressed than those identified using less donors. If so, this would indicate that the amount of sequencing data rather than number of individuals allows identification of new minimally expressed lncRNA loci. We plotted the expression level of lncRNA transcripts initiating from loci annotated using more donors, that could not be identified (defined as 50% sense overlap) using less donors. We found that expression of lncRNAs that require more and more donors to be identified does not anti-correlate with the donor number. Thus the identification of more lncRNAs in larger donor collections does not specifically identify lowly expressed transcripts. The boxplot shows the maximal (top) and mean (bottom) (among the donors used for the annotation) expression level of de novo LCL lncRNA transcripts annotated using the indicated number of donors (indicated on the X axis). Only transcripts expressed from loci that have not been identified ("identified"= >50% sense overlap) using less donors (indicated on the X axis) are displayed. 9 boxes for each number of donors show the result for all the nine possible pairwise comparisons between three replicates of each donor-number annotations (e.g. "found in 8d but not in 4d": box1. lncRNA loci in 8-donor annotation replicate 1 not identified in 4-donor annotation replicate 1, box2. lncRNA loci in 8-donor annotation replicate 1 not identified in 4-donor annotation replicate 2, box3. lncRNA loci in 8-donor annotation replicate 1 not identified in 4-donor annotation replicate 3, box4. lncRNA loci in 8-donor annotation replicate 2 not identified in 4-donor annotation replicate 1, etc). 120-donor annotation only has 1 replicate, thus giving just three boxes. Outliers are not displayed in the boxplot. Horizontal dashed red line and red number indicate median level of the first box in the boxplot.  Figure S33. Increasing donor number identifies increased numbers lncRNAs in all expression bins.

Dynamics of identification of lncRNA transcripts of different expression level
Dynamics of identification upon donor number increase of transcripts split into 6 bins according to their maximal expression among donors used for their identification (Additional File 11E). Every plot shows number of transcripts (orange) and loci (black) these transcripts initiate from. Number of transcripts/loci is normalized to the number of transcripts/loci in 120-donor annotation. Absolute number of transcript/loci is given for 4-donor and 120-donor annotations (boxes above the plots). Error bars show standard deviation between 3 replicates for each donor number. Remarks: bin 0 has not been used in other figures and represent marginally expressed transcripts (RPKM<0.5). The dynamics of de novo identification of new (light gray) and known (dark gray) lncRNA loci in LCL using an increasing donor number (new -not covered by reference public annotations, knowncovered by reference public annotations. As described for Fig. 2A). Dynamics for all lncRNA loci (dashed green line) and mRNA loci (dashed blue line) is indicated for comparison. The loci number is normalized to the total number of loci in the most comprehensive 120-donor annotation and set to 100% for each curve. Maximum number (100%) for new lncRNA loci: 2,063, maximum number (100%) for known lncRNA loci: 2,103. Error bars indicate standard deviation between 3 replicates of random picking for each number of assemblies used (Additional File 11C).

Figure S35
Exonic coverage of mRNAs and lncRNAs annotated from 120 donors by annotations obtained using less donors

Number of donors used for transcriptome annotation
Number of "120-donor" mRNA and lncRNA loci identified using less donors Donor saturation analysis. Top: definition of an "identified locus": lncRNA and mRNA annotation obtained using 120 donors (the most comprehensive annotation) is used as a reference for comparison and the number of loci in this annotation is set to 100%. When analyzing annotations obtained from fewer donors, a reference locus is called "identified" if it is covered by the down-sampled annotation to at least 50% of its length (black tick). In case the coverage is less the locus is not considered "identified" (red cross). Below: normalized number of 120-donor reference annotation loci "identified" when running the identification pipeline with fewer donors. Displayed are donor saturation curves for mRNA loci (blue), all lncRNA loci (green), only new lncRNA loci (unfilled light grey circles and dashed line) and only known lncRNA loci (unfilled dark grey circles and dashed line). Number loci in the 120-donor annotation was set to 100% for all the four displayed loci type. Error bars: representing standard deviation between 3 replicates are present for all data points but due to low values are mostly not visible (Additional File 11D). B. Donor saturation curve of exon structure identification. The plot shows the percent exonic coverage (Supplemental Methods) matching the reference annotation generated from 120 donors (set to 100%), obtained using fewer donors. RNA isolation using TRI reagent 3.
Preparation of strand-specific RNA-seq libraries 7.
Public gene annotations used in the study 8.
Calculation of GC content 10.
RT and qRT-PCR primer design 11.
Annotating mRNAs and lncRNAs in primary granulocytes de novo 11.1 Filtering steps 11.1.1 Filtering for mRNAs 11.1.2 Filtering for lncRNAs 11.2 Combining de novo lncRNA and mRNA transcripts into genomic loci 11.3 Protein-coding potential calculation pipeline 12.
Creating granulocyte specificity estimation heat maps 14.
RT-PCRs to control splicing efficiency calculation

Granulocyte isolation
Granulocytes and mononuclear cells (MNCs) were isolated from freshly collected blood using gradient density centrifugation (Ficoll-Plaque PREMIUM 1.078 g/ml, GE Healthcare Life Sciences). Briefly, 45 ml of fresh blood was centrifuged at 100g for 10 minutes at room temperature and the yellowish supernatant was discarded to remove the platelet rich plasma. The remainder was diluted approximately four fold with room temperature (RT) PBS (+2 mM EDTA) to 144 ml. 35 ml of diluted blood was carefully layered on top of 15 ml of Ficoll (equilibrated at RT) in a 50 ml Falcon tube and four such tubes were centrifuged at RT at 400g for 33 minutes with acceleration/brake at minimum. After centrifugation the upper layer was carefully removed and discarded, the MNC layer immediately on top of the Ficoll separation layer was collected into a new tube and washed in ice cold PBS (+2mM EDTA) by centrifugation at 300g for 10 minutes. The upper Ficoll layer was carefully removed and the underlying remaining layer containing the granulocyte population was depleted for erythrocytes using Cell Lysis Solution (Promega) in two 5 minute incubation steps followed by 5 minute 300g centrifugation at RT. Granulocytes were then washed in ice cold PBS (+2 mM EDTA) by centrifugation at 300g for 5 minutes. Both MNCs and granulocytes underwent one further ice-cold PBS (+2 mM EDTA) washing step (8 minutes at 200g) to remove residual platelets and to create a pellet for immediate RNA isolation 2. RNA isolation using TRI reagent Pelleted cells were lysed in 1 ml of TRI reagent (Sigma-Aldrich T9424) per 10 7 cells by active pipetting / vortexing and incubated for 5 minutes at room temperature (RT) and the lysate frozen and stored at -80 o C for later RNA isolation. After thawing on ice, 0.1 ml BCP (Molecular Research Center, Inc.) was added per 1 ml of TRI reagent, followed by intensive shaking and 10 minute incubation at RT, and then centrifuged for 12 minutes at 12000g at 4 o C. The upper aqueous phase was transferred to a new tube with 0.5 ml isopropanol, vortexed and incubated for 10 minutes at RT to allow RNA precipitation. The RNA precipitate was pelleted by centrifugation at 12000g for 12 minutes at 4 o C, the supernatant was removed and the pellet washed with 1 ml of 70% ethanol (7500g, 5 minutes, 4 o C). After ethanol removal, the pellet was air-dried for few minutes and dissolved in RNA Storage Solution (RSS) (Ambion) and stored at 80 o C. DNaseI treatment was performed for 10 µg RNA per 50µl reaction, using the DNA-free kit (Ambion).

Annotating mRNAs and lncRNAs in primary granulocytes
Transcriptome assembly: 17 PolyA+ RNA-seq data sets (comprising ten different healthy donors) with total of 784 M mapped reads were used to create the granulocyte mRNA and lncRNA annotations used for further analysis (Additional File 2A, B). Although it is suggested that each sample's transcriptome should be assembled separately [7], PolyA+ datasets were pooled into 6 parts at the stage of alignment in order to increase the sensitivity of the transcriptome assembly (Additional File 2C). Pools were created so that they contained 24-37 M spliced reads each (85-102 M mapped reads for 4 x 100bp paired-end pools and ~ 220 M reads for 2 x 50 bp paired-end pools). Six alignment .BAM files were created as described in Methods and then used to perform six separate transcriptome assemblies using Cufflinks [7]. In order to prevent a bias towards identification of already annotated transcripts no reference gene annotation was provided to Cufflinks. In order to avoid Cufflinks pausing over problematic regions in the genome annotated pseudogenes (GENCODE v 19) were masked using --mask-file option. Cufflinks was run for 6 .BAM files (6 pools' alignments) with the following options: cufflinks --multi-read-correct --output-dir [output_dir] -F 0.01 -p 7 --library-type fr-unstranded (for stranded pools --library-type fr-firststrand) --mask-file pseudogenes.gtf PolyA_pool_N.sorted.bam Removal of mono-exonic transcripts: All single exon transcripts were removed from each of the 6 transcriptome assemblies using gffread tool (part of Cufflinks package): gffread transcripts.gtf -T -Uo transcripts_multiexon.gtf. The three main rationales for focusing on multi-exonic transcripts were the following. First, the majority of mono-exonic transcripts assembled by Cufflinks appear to be intronic signals and other artifacts. Second, mono-exonic transcripts assembled by Cufflinks are not continuous unlike spliced transcripts whose continuity is supported by the spliced reads spanning thousands of kilobases. Third, 13 out of 17 PolyA+ datasets we used for the assembly were not strand-specific. However, the presence of a splice site in a transcript allowed Cufflinks to infer the strand it was transcribed from. Previous publications extensively used non-strand-specific data for Cufflinks based annotation of multi-exonic lncRNAs in human and have shown that the error rate of inferring the strand from the canonical splice sites was negligible [3,8]. Additionally, to be maximally stringent, we also removed potential strand specificity artifacts at later filtering steps. Merging the annotations: The resulting six multi-exonic transcriptome annotations were merged using Cuffmerge with the following command: cuffmerge -s hg19.fa --keep-tmp -p 8 --min-isoformfraction 0 list_of_6_annotation_files.txt. The resulting merged annotation contained 158,038 transcripts comprising 13,589 loci.

Filtering Steps
Additional 55 The merged transcriptome annotation was then filtered in order to create granulocyte mRNA and lncRNA annotations for further use in the study. It was necessary to de novo annotated mRNA genes as we noticed that comparison of de novo annotated lncRNAs to mRNAs annotated by RefSeq or GENCODE, was misleading due to the precision of the de novo annotation for granulocytes where only the isoforms actually expressed in granulocytes were annotated, and due to potential artifacts of de novo annotation missing from the curated, comprehensive and experimentally supported public mRNA annotations. Thus, in order to avoid potential technical biases, mRNA annotation was created de novo in granulocyte using the same pipeline that was used for lncRNAs. The following common filtering steps for both mRNAs and lncRNAs: Expression cut off: We used 6 pools of RNA-seq data to increase the sensitivity of Cufflinks to lowly expressed spliced isoforms of lncRNAs. However, the increased sensitivity could potentially result in false positive transcripts. We checked if the assembled transcripts could be detected in at least one of the diverse granulocytes RNA-seq samples from 10 individuals used in the study. We used an expression level calculation method independent from Cufflinks (RPKM_count.py from RSeQC package) to analyze the expression of all the de novo assembled transcripts in all the available granulocyte RNA-seq datasets (17 PolyA+ RNA-seq datasets + 21 total RNA-seq datasets). If a transcript was not expressed (RPKM<=0.2) in any of the datasets, we called it an artifact and removed from the annotation. By this step 0.4% (631) of transcripts was removed resulting in residual 157,407 transcripts. Filtering out transcripts potentially assigned to the wrong strand: Although Cufflinks infers the strand of the transcript from the direction of spliced junctions within this transcript, we wanted to be stringent and remove potential artifacts assigned to the wrong strand. For that we performed two steps of filtering. First, using a custom script, we checked if the de novo annotation contains transcripts that have a "mirror" transcript on the other strand (that is a transcript that has exons with > 30% reciprocal antisense overlap with exons of a transcript on another strand). Such transcripts arising from problems in the strand-specificity step could be potential artifacts and had to be removed in order to create a stringent set of transcripts. In each pair of such transcripts we then kept the one with the higher expression level. As an expression level estimate for each transcript we took the maximum RPKM among all the stranded RNA-seq datasets (four PolyA+ RNA-seq datasets and 21 total RNA-seq datasets). By this step 1.4% (2,273) of transcripts was removed from the annotation, resulting in residual 155,134 transcripts. 107 transcripts that fulfilled the criteria were not expressed in any of the stranded samples (RPKM<=0.2), and therefore could not be filtered out and were kept in the annotation and run through the next stage of filtering. Second, transcripts that had exons with >20% reciprocal antisense overlap with exons of an annotated mRNA (RefSeq or GENCODE v19) or lncRNA (GENCODE v19) expressed in any of the 6 pools were removed from the annotation. By this step 2.0% (3,142) of transcripts was removed from the annotation, resulting in residual 151,992 transcripts. Size cut off: LncRNA transcripts are by definition longer than 200nt. Therefore we removed all the transcripts whose summary exon length was <200nt. By this step 0.02% (37) of transcripts was removed from the annotation, resulting in residual 151,955 transcripts. Exon length cut off: To further remove potential artifacts from the annotation, we filtered out the transcripts with unusually long exons and with an unusually high exon/intron length ratio. To set the cutoff we checked the properties of the annotated mRNA and lncRNA genes and found that 99.4 % of GENCODE multi-exonic lncRNAs, 99.7% of GENCODE multi-exonic mRNAs and 99.7% RefSeq multi-exonic mRNAs do not have exons longer than 15kb and their exons constitute less than 90% of total gene length ( Figure S1C). We removed all the transcripts not fulfilling these criteria. By this step 3.9% (5,939) of transcripts was removed from the annotation, resulting in residual 146,016 transcripts. Repeat coverage cut off: lncRNAs are rich in repeat elements [9] and we allowed reads that mapped in several locations in the genome to be aligned (see RNA sequencing read alignment above). Therefore, we allowed some repeat elements to be mapped and potentially assembled as transcripts. It was necessary then to remove the transcripts assembled mainly from repeat regions and thus potentially being artifacts. Using the following command containing the custom script -bed12ToBed6 -i annotation.bed | coverageBed -b stdin -a RepeatMaskUCSC.bed | perl repeat_coveage.pl > coverage_of_genes -we performed a control check of annotated multi-exonic lncRNA and mRNA genes (lncRNAs by GENCODE v19, mRNAs by GENCODE v19 and RefSeq -see above Public gene annotations used in the study) and found that repeat coverage of exons of 95,5% of GENCODE multi-exonic lncRNAs, 99,87% of GENCODE multi-exonic mRNAs and 99,96% RefSeq multiexonic mRNAs did not exceed 80% ( Figure S1D). By that we set the cutoff to 80% for the filtering and, using the command given above, removed all the de novo transcripts whose exons were covered by repeats more than 80%. By this step 0.7% (1,029) of transcripts was removed from the annotation, resulting in residual 144,987 transcripts.

Filtering for mRNAs -Creating de novo mRNA annotation in primary granulocytes
Overlap with annotated protein-coding genes: As the protein-coding genome is very well annotated we defined de novo mRNAs as transcripts that overlapped exons of protein-coding genes annotated by RefSeq or GENCODE v19 in the sense orientation (we used intersectBed tool with thesplit option). This filtering step resulted in 136,482 transcripts being called protein-coding based on their overlap with the annotation. Filtering out transcripts spanning from mRNAs to annotated lncRNAs: Some de novo transcripts spanned over more than one gene, which can be caused by Cufflinks artificially joining spliced transcripts located close to each other. However it is also known that transcription from a gene can run through a downstream gene and use its splice sites to create a chimera transcript [10]. We aimed to remove such chimera transcripts. While it was possible to remove protein-coding transcripts that span more than one protein-coding gene, due to the poor annotation of lncRNAs, we could not exclude transcripts that comprise a chimera of two lncRNA genes merged together by Cufflinks. As our goal was to process de novo mRNA transcripts similarly to de novo lncRNA transcripts, we did not apply this filtering to de novo mRNAs or lncRNAs. However, we could exclude the case when an artifact chimeric transcript combines an mRNA with a lncRNA. De novo lncRNAs were filtered not to share a sense exonic overlap with annotated protein-coding genes (see below) and we similarly removed de novo mRNA transcripts that spanned to a GENCODE v19 annotated lncRNA (note, that some annotated lncRNAs do have a sense exonic overlap with annotated mRNAs and we took care of such cases). By this step 3.7% (5,059) of de novo mRNA transcripts were removed from the annotation, resulting in residual 131,423 transcripts.

Filtering for lncRNAs -Creating de novo lncRNA annotation in granulocytes
Filtering out transcripts: protein-coding genes and pseudogenes: To form a preliminary de novo lncRNA annotation the transcripts that passed all the common filtering steps, but had any exonic sense overlap with a protein-coding gene (GENCODE or RefSeq) or a pseudogene (GENCODE) were removed. By this filtering step 94.6% (139,080) of all lncRNA transcripts were removed from the annotation, resulting in residual 7,862 transcripts.
Filtering out extensions of protein-coding genes: While creating the de novo mRNA set we also identified 3' or 5' extensions of annotated protein-coding genes. The previous lncRNA filtering step, which excluded all transcripts overlapping the exons of annotated protein-coding genes could leave in transcripts corresponding to these extensions. To exclude this possibility, we removed transcripts that had any exonic sense overlap with the de novo annotated mRNAs. By this filtering step 6.0% (476) of lncRNA transcripts were removed from the annotation, resulting in residual 7,386 lncRNA transcripts. Removing transcripts that overlap protein-coding genes in the sense direction: Out of 7,386 lncRNA transcripts 317 overlapped annotated or de novo assembled mRNA genes in the sense direction. We removed these transcripts from the list of de novo lncRNAs to avoid confusing their expression with the expression of overlapped protein-coding gene during the expression variation analysis. After removing sense overlapping transcripts we obtained a list of 7,069 de novo lncRNA transcripts.

Combining de novo lncRNA and mRNA transcripts into genomic loci
Transcripts were initially grouped into loci by Cuffmerge. However, after the filtering and artifact removal, many transcripts were removed and the rest had to be slightly regrouped. For example, if a de novo transcript spanned both a protein-coding gene and a lncRNA gene, two separate loci would be first grouped into one by Cuffmerge. After the removal of the artifact spanning transcript, the transcripts corresponding to a protein-coding gene would form one locus, and transcripts corresponding to a lncRNA gene would form another. We redefined the locus definition to account for removal of some transcripts using a custom script. 131,423 de novo mRNA transcripts formed 10,029 genomic loci with a mean of 13.1 transcripts per locus (median -10 transcripts per locus). 7,069 de novo lncRNA transcripts formed 1,691 genomic loci with a mean of 3.9 transcripts per locus (median -1 transcript per locus).

Protein-coding potential calculation pipeline
We based our mRNA de novo annotation on filtering for transcripts exonically overlapping annotated protein-coding genes. On the other hand, we based the de novo annotation of lncRNAs filtering for transcripts that had no exonic overlap with annotated protein-coding genes. However, although we combined both RefSeq and GENCODE v19 public annotations for protein-coding genes, a possibility remained that within the lncRNA list there were unknown transcripts coding for proteins. To test the coding potential of transcripts remaining in the lncRNA list, we performed an estimation of proteincoding potential of each de novo annotated transcript. We used a combination of two previously developed tools: RNAcode [11] and Coding Potential Calculator or CPC [12]. We used a local version of CPC (cpc-0.0-r2) that was modified to work with HMMER 3.0 [13] instead of blastx using UniProtKB/Swiss-Prot database (Jan 2012) (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fas ta). cDNA of the transcript for HMMER input was created using the getfasta tool from the bedtools suite: bedtools getfasta -bed [bed12 file] -fi mm10.fa -s -split -fo [cDNA FASTA file]. Potential peptides originating from this transcript were determined using the transeq tool from EMBOSS6.5.7: transeq -frame F [cDNA FASTA file] [translated protein sequence]. The result of this program is a continuously translated protein sequence for each of the three forward frames. As this sequence contains many stop codons we reduced the runtime of HMMER by extracting those peptide sequences that are between two stop codons using a custom script. The first peptide sequence (before the first stop) as well as the last peptide sequence (after the last stop) was retained if it was longer than 20 amino acids. All other sequences were retained if they contained a start codon (M) that was located more than 30 amino acids from the next stop. HMMER was used with the following command: phmmer -E 0.0000000001 --cpu 3 -tblout [output file with alignments].
The exact criteria for calling a gene non-protein-coding were determined by analyzing a set of wellstudied lncRNAs (H19, XIST, JPX, MALAT, NEAT1, TUSC7, ANRIL, MIAT, HULC, HOTTIP, HOTAIR and HOTAIRM1 -see Additional File 2D). We controlled for false positive results by then applying the pipeline to public annotations of coding and non-coding multi-exonic transcripts. With the chosen criteria 98.9% of RefSeq and 96.1% of GENCODE multi-exonic protein-coding transcripts were identified as protein-coding. Accordingly, 94.4% of multi-exonic GENCODE lncRNA transcripts were identified as non-protein-coding. The final criteria for calling a transcript non-coding were the following: CPC score of a transcript had to be less than 1.6. RNAcode score had to be less than 18 (in case genome alignments for 3 species were available for the transcript and RNAcode score could be calculated). We discarded all the de novo lncRNA transcripts that were identified by the pipeline as having protein-coding potential. Moreover, if more than 15% of transcripts in one locus were called protein-coding, we discarded the whole locus. Thus, we obtained the final de novo lncRNA annotation consisting of 6,249 lncRNA transcripts. We lastly fine-tuned the loci definition (for loci where some isoforms were removed by the protein-coding potential filtering) and obtained the final annotation of 1,591 lncRNA loci.

Calculating exonic coverage
Exonic coverage of one ("reference") multiexonic annotation by another ("analyzed") was calculated using a custom Perl script. For each transcript of the reference annotation we looked for transcripts of the analyzed annotation which would exonically overlap it in the sense orientation. From these transcripts the one that covered the highest percentage of the exonic length of the reference transcript was picked and the exonic coverage given by this transcript was used as an output of the analysis for a given reference transcript.

Creating granulocyte specificity estimation heat maps
Each heat map was created from a table listing all the transcripts/loci for each annotation and corresponding RPKMs (calculated by RPKM_count.py) in 36 (GRA_pap, GRA_tot and 34 public RNA-seq samples) samples. Prior to building the heat map, an expression cut off was applied filtering for transcripts expressed (RPKM>0.2) in at least one sample, RPKMs were normalized to the maximum RPKM among all the samples for each transcript (row) and the maximum was set to one. The data table was then sorted using a custom script to organize the columns such that the transcripts showing >70% expression level relative to the maximum would be place on the top and then the rest of the table would be sorted the same way for the next column. This procedure was done consequently for all the columns. We then picked the transcripts/loci that fulfill "granulocyte specificity" criteria and placed them in the upper part of the table, and the rest of the transcripts/loci, to the lower part of the table. Then we used pheatmap function in R without the clustering option for rows or columns, to create the final heat maps.

RT-PCR to test splicing efficiency calculation
Additional  To test bioinformatic calculation of splicing efficiency we picked efficiently spliced (defined as mean splicing efficiency per transcript>80%, maximal splicing efficiency per transcript>90%) and inefficiently spliced (defined as mean splicing efficiency per transcript<50%, maximal splicing efficiency per transcript<70%) de novo granulocyte transcripts, eight each (4 x lncRNAs, 4 x mRNAs). We preliminary filtered lncRNA and mRNA transcripts to contain at least one junction which could be tested using our assay, i.e. which would be short enough to allow amplification of both spliced and unspliced products by standard PCR (junction length<1500bp). We additionally filtered the transcripts for relatively high expression (RPKM>1) to facilitate RT-PCR amplification. We also did not pick transcripts that were antisense to another gene since RT-PCR is not strandspecific. List of picked lncRNA and mRNA transcripts and primers to amplify the short junction are given in Figure S13B. Size of the expected PCR product amplifying the unspliced junction is given in the right-most column of the table in Figure S13B. RT-PCR program: 95 o 3min, (95 o 30sec, 59 o 30sec, 72 o 1 min) for 35 cycles, 72 o 7min. Only the 8x2 randomly picked transcripts (one junction each) were tested. Only one primer pair per junction was tested.