Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data

A set of methods is presented for normalization, quantification of noise and co-expression analysis for gene expression studies using deep sequencing.


Replicate scatter for Solexa RNA-seq data
For the same two RNA-seq samples figure S2 shows a scatter-plot of the number of reads per position in the two samples.

Per `exon' replicate scatter for Solexa RNA-seq data
For the same data-set shown in figure S2 we used single-linkage clustering to cluster overlapping reads into`exons'. Figure S3 shows a scatter plot analogous to figure S2 but now for the expression of these`exons' across the two replicates.

CAGE per TSS replicate scatter
Two independent CAGE samples were obtained from a common RNA sample from THP-1 cells after 8 hours of treatment with LPS. Figure S4 shows a scatter-plot of the normalized tags-per-million of each TSS for these two replicate samples.

CAGE per gene replicate scatter
For the same two replicate samples shown in figure S4 we summed, for each gene, the expression from all TSSs associated with the gene, to obtain a normalized expression per gene. Figure S5 shows a scatter-plot of the per gene expression of the CAGE replicates.

Comparison with FANTOM3 clustering
For human our data contained a total of CAGE tags representing unique TSS locations in the human genome.  First of all we see that a significantly larger number of unique TSSs are included in the FANTOM3 clustering. This is a result of the fact that TSSs with expression profiles significantly different from those in the TSC (which may often be low expressed TSSs) are clustered with the TSC in the FANTOM3 clustering, whereas in our clustering these form separate TSCs who are then filtered out owing to their low expression. The total number of TSCs in the FANTOM3 clustering is lower because neighboring TSCs with different expression profiles are all clustered together in the FANTOM3 clustering. Even though the number of TSCs is smaller in the FANTOM3 clustering, the final number of TSRs is a little larger because, owing to the tendency of the FANTOM3 clustering to cluster all nearby TSSs, irrespective of their expression profile, a large number of low expressed TSRs pass the cut-off on minimal expression in the filtering stage. Figure S6 shows a comparison of the distributions of the number of TSSs per TSC, the number of TSCs per TSR, and the number of TSSs per TSR, for our clustering and for the single-linkage clustering that was employed in FANTOM3.
As illustrated by the left and right panels of figure S6, there are in general more TSSs per TSC and more TSSs per TSR for the FANTOM3 clustering. In contrast, there tend to be more TSCs per TSR for our clustering. Both these observations are a result of the fact that in our clustering TSSs with different expression profiles are not clustered together, even if they are near each other, whereas the single-linkage clustering fuses all these TSSs into a single TSC. Figure S7 shows the distributions of the lengths TSCs and TSRs for both our clustering and the FANTOM3 clustering. Although on the logarithmic scales the length distributions appear quite similar for the two clustering procedures, the TSCs obtained by the FANTOM3 clustering tend to be significantly wider. More strikingly, for the FANTOM3 clustering there is a pronounced shoulder in the distributions at a width of base pairs, which is almost certainly an artifact of the fact that this distance is precisely the cut-off distance on the single-linkage clustering.

Nearby uncorrelated TSSs
In figure 12 of the main article we showed an example of neighboring TSCs that have significantly different expression profiles, which were shown in panel C. To further illustrate that these expression profiles are indeed not correlated figure S8 shows a scatter plot of the expression of the two TSCs across the CAGE samples. The plot confirms that there is no discernible correlation between the expression profiles of the two TSCs, and they are certainly not tightly co-regulated, which supports that these two TSCs are driven by distinct sets of regulatory sites.  In figure S9 we show another example of a set of nearby TSCs with clearly distinct expression profiles. The interesting feature of this example is that there are two broad TSCs, containing a substantial number of TSSs that all show correlated expression, which are interspersed by a single TSS that shows a very different expression profile (the red TSS). The structure of this promoter region suggests that, on the one hand, there is a broad region to which the polymerase is recruited by one set of regulatory mechanisms, while on the other hand there is a single TSS within the same region to which the polymerase is recruited by a distinct regulatory mechanism.

Mouse Promoterome Statistics
For the mouse promoterome, as for the human promoterome, we first calculated the distribution of phastCons conservation scores as a function of position relative to the most expressed TSS in each TSC. Figure S10 shows the phastCons conservation profiles that we obtained for both all TSCs (left panel) and the novel TSCs (right panel).

5
Next we determined the position of the closest start of a known transcript for each mouse TSC. Figure S11 shows the distribution of the relative positions of the closest known starts for all mouse TSCs that have a known start within base pairs of the TSC. The conservation profiles for mouse are very similar to the ones that we observed for human. We again see a sharp peak of conservation covering a few hundred base pairs around TSS. The novel promoters show a conservation peak of similar width but with lower height. Interestingly, whereas for human the conservation peak of the novel promoters was close to symmetric, for mouse the novel promoter peak is also clearly asymmetric, although still not as asymmetric as the peak for the known TSSs.
Negative numbers mean the nearest known start is upstream of the TSC. The vertical axis is shown on a logarithmic scale. The figure shows only the TSCs ( 59 ) with a known start within 1000 base pairs.

45, 603 %
The distribution in figure S11 is also very similar to what we observed for the human promoterome. The main difference is that whereas for human 62. of all TSCs have a known start within base pairs, for mouse this is only , which is likely due to the larger amount of data available for human.
2% 59% 1000 Figure S12 shows the hierarchical structure of the mouse promoterome that we constructed. In particular, we show the distribution of the number of TSSs per TSC, the number of TSCs per TSR, and the number of TSSs per TSR, as we also showed for the human promoterome in the main article. The distributions in figure S12 are generally very similar to those observed for the human promoterome. The distributions are all a little less wide than for human, which is likely the result of the larger amount of data available for human. Importantly, as in the human data, the distribution of the number of TSSs per TSR also shows the clear`shoulder' corresponding to TSRs with between roughly 10 and 50 TSSs.
Finally, we also calculated the length distributions of mouse TSCs and TSRs, both using our clustering procedure, and using the single-linkage clustering employed in FANTOM3 (figure S13). Here too the distributions are very similar to the results that we obtained for the human data. In particular, we clearly see the shoulder in the distribution of TSR lengths for lengths roughly between and 150 base pairs long. We also again see that the single-linkage clustering leads to wider clusters, and leads to an artificial shoulder at base pairs (i.e. the length of the CAGE tags that was chosen as a distance cut-off).