A steganalysis-based approach to comprehensive identification and characterization of functional regulatory elements

WordSpy, a novel, steganalysis-based approach for genome-wide motif-finding is described and applied to yeast and Arabidopsis promoters, identifying cell-cycle motifs.

from each other, the ACOP or SACOP scores are relatively low. Another approach is to count the number of good pairs as follows. We randomly sampled 100 genes from the entire genome and calculated all pairwise coherence of their expression profiles. We then defined the fifth percentile of the distribution of these pair-wise coherence as a threshold T . For a set of N genes, we counted the number of good pairs of genes whose expression coherence are above the threshold T . By randomly sampling N genes from the genome and counting the number of good pairs for each sample, we calculated a Z-score, named GP score, to measure how significant the number of good pairs in the original set versus the random samples. This approach can somehow overcome the problem of splitting clusters.
We ran the experiments on the yeast cell-cycle genes with all three methods. The known cell-cycle motifs of yeast and their G-score rankings with different methods are listed in the supplemental table 2. The results suggest that the three methods are comparable. Since the ACOP score is the easiest to compute, we adopt it in our WordSpy implementation.
Deciphering an English stegoscript. We applied WordSpy to a stegoscript (∼268K letters) that had the first ten chapters (∼112K letters) of the novel Moby Dick embedded within. Supplemental Fig. 1 shows a small portion of the stegoscript, where the underlined text is the title and first two sentences of Chapter One. We ran WordSpy with different Z-score thresholds and tried to find words of maximum length of 15. We measured performance by the true positive rate (TPR), the percentage of true words discovered over all the words in the original text, and false prediction rate (FPR), the percentage of false predictions in the deciphered text. In measuring these rates, we considered different degrees of matches between a word in the original text and a predicted word in the deciphered text. We defined the match rate of a word as the percent of the characters of the word that match to the original word. We considered a prediction correct if its word match rate is greater than a threshold. As summarized in Supplemental Table 1, TPR decreases and FPR increases as the threshold of word match rates increases. Under the most stringent criterion of 100% word match, WordSpy is able to recover ∼70% exact original words with a false prediction rate of ∼19% using Z-score threshold 6. As a comparison, when the word match rate threshold is 50%, TPR increases to ∼82% while FPR decreases to ∼4.6%.
A close examination showed that the FPR initially decreases and then stays relatively constant as the Zscore threshold increases (Supplemental Fig. 2(a)). When the Z-score threshold is high enough (>5.5), most falsely predicted words will be filtered out. On the other hand, the true positive rate (TPR) always decreases as the Z-score threshold increases. The overall best performance seems to be reached around the Z-score threshold of 6 (Supplemental Fig. 2(b)).
To further analyze WordSpy's performance, we tested it on stegoscripts with uniformly random covertext of different sizes. Using the first ten chapters of Moby Dick, we generated six scripts with the ratios of covertext to secret messages ranging from 2 to 7. We expect the deciphering problem to become harder as the amount of covertext increased. Supplemental Fig. 3 shows the results with Z-score threshold 3 on all six scripts. As expected, the true positive rate decreased and the false prediction rate increased as the amount of covertext increases. However, even when the covertext was 7 times as big as the original novel, WordSpy performed reasonably well; it accurately predicted 75% of the original words and had only 53% of its predictions false positive.
To complete our example in Supplemental Fig. 1(a)  (b) The ratios of true prediction rate over false prediction rate on different Z-score thresholds. The results are listed for different word matching ratios. 100% match means exactly matching. 50% match means that at least half of a word must be covered by the deciphered text to be considered predicted.    1: Results on a stegoscript containing the first ten chapters of novel Moby Dick for Z-score threshold 6. Total 18930 words are in the original text. Total discovered words in the deciphered text are 16522. Word match ratio determines the least percentage of position matches for a true word to be considered correctly predicted. True words discovered gives the numbers of true words correctly predicted. True positive rate is the percentage of true words discovered over the total words in the original text. False words reported is the number of words falsely predicted based on different word match ratios. False prediction rate is the percentage of false words reported over the total words in the deciphered text.        table 5 and 6 into 55 clusters based on the sequence similarity. In each cluster, motifs overlap at least 7 nucleotides. For each cluster, we plot the motif logo based on their target sites in all the cell-cycle genes. The whole table is in the excel file WangSuppTable7.xls.