Workflow of the algorithm and analysis. (a) DeepCAGE data from 33 FANTOM4 RNA libraries are used to define transcription start site (TSS) clusters of overlapping tags in regions up to 50 kb upstream of annotated precursor miRNAs (pre-miRNAs). (b) TSS clusters overlapping with the gene starts of other annotated Ensembl transcripts are filtered out, as well as tags spanning exonic regions. (c) Sequence features, such as CpG content, conservation score and TATA box affinity, are calculated in both 1,000 bp long regions around putative TSSs and in random intergenic regions, and, together with read count distributions, are used to model the mixture of the two promoter and non-promoter classes. (d) Those regions whose probability of being a promoter is higher than the probability of being noise are used to analyze general characteristics of miRNA promoters, such as enrichment of specific transcription factor binding sites.
bp, base pair; kb, kilobase; pre-miRNA, precursor miRNA; TSS, transcription start site.