Comprehensive identification and characterization of conserved small ORFs in animals

There is increasing evidence that non-annotated short open reading frames (sORFs) can encode functional micropeptides, but computational identification remains challenging. We expand our published method and predict conserved sORFs in human, mouse, zebrafish, fruit fly and the nematode C. elegans. Isolating specific conservation signatures indicative of purifying selection on encoded amino acid sequence, we identify about 2000 novel sORFs in the untranslated regions of canonical mRNAs or on transcripts annotated as non-coding. Predicted sORFs show stronger conservation signatures than those identified in previous studies and are sometimes conserved over large evolutionary distances. Encoded peptides have little homology to known proteins and are enriched in disordered regions and short interaction motifs. Published ribosome profiling data indicate translation for more than 100 of novel sORFs, and mass spectrometry data gives peptidomic evidence for more than 70 novel candidates. We thus provide a catalog of conserved micropeptides for functional validation in vivo.

Ongoing efforts to comprehensively annotate the genomes of humans and other species 2 revealed that a much larger fraction of the genome is transcribed than initially 3 appreciated 1 . Pervasive transcription produces a number of novel classes of non--coding 4 RNAs, in particular long intergenic non--coding RNAs (lincRNAs) 2 . The defining feature of 5 lincRNAs is the lack of canonical open reading frames (ORFs), classified mainly by 6 length, nucleotide sequence statistics, conservation signatures and similarity to known 7 protein domains 2 . Although coding--independent RNA--level functions have been 8 established for a growing number of lincRNAs 3,4 , there is little consensus about their 9 general roles 5 . Moreover, the distinction between lincRNAs and mRNAs is not always 10 clear--cut 6 , since many lincRNAs have short ORFs, which easily occur by chance in any 11 stretch of nucleotide sequence. However, recent observations suggest that lincRNAs and 12 other non--coding regions are often associated with ribosomes and sometimes in fact 13 translated 7--16 . Indeed, some of the encoded peptides have been detected via mass 14 spectrometry 10,17--23 . Small peptides have been marked as essential cellular components 15 in bacteria 24 and yeast 25 . More detailed functional studies have identified the well-- 16 known tarsal--less peptides in insects 26--29 , characterized a short secreted peptide as an 17 important developmental signal in vertebrates 30 , and established a fundamental link 18 between different animal micropeptides and cellular calcium uptake 31,32 . 19 Importantly, some ambiguity between coding and non--coding regions has been 20 observed even on canonical mRNAs 15 : upstream ORFs (uORFs) in 5' untranslated 21 regions (5'UTRs) are frequent, well--known and mostly linked to the translational 22 regulation of the main CDS 33, 34 . To a lesser extent, mRNA 3'UTRs have also been found 23 associated to ribosomes, which has been attributed to stop--codon read--through 35 , in 24 other cases to delayed drop--off, translational regulation or ribosome recycling 36 , and 25 even to the translation of 3'UTR ORFs (dORFs) 10 . Translational regulation could be the 26 main role of these ORFs, and regulatory effects of translation (e.g., mRNA decay) could 27 be a major function of lincRNA translation 12 . Alternatively, they could be ORFs in their 28 own right, considering well--known examples of polycistronic transcripts in animals such 29 as the tarsal--less mRNA 26--28 . Indeed, many non--annotated ORFs have been found to 30 produce detectable peptides 10, 17 , and might therefore encode functional 31 micropeptides 37 . 32 Typically, lincRNAs are poorly conserved on the nucleotide level, and it is hard to 33 computationally detect functional conservation despite sequence divergence even when 34 it is suggested by synteny 2,38 . In contrast, many of the sORFs known to produce 35 4 functional micropeptides display striking sequence conservation, highlighted by a 1 characteristic depletion of nonsynonymous compared to synonymous mutations. This 2 suggests purifying selection on the level of encoded peptide (rather than DNA or RNA) 3 sequence. Also, the sequence conservation rarely extends far beyond the ORF itself, and 4 an absence of insertions or deletions implies conservation of the reading frame. These 5 features are well--known characteristics of canonical protein--coding genes and have in 6 fact been used for many years in comparative genomics 39,40 . While many powerful 7 computational methods to identify protein--coding regions are based on sequence 8 statistics and suffer high false--positive rates for very short ORFs 41,42 , comparative 9 genomics methods have gained statistical power over the last years given the vastly 10 increased number of sequenced animal genomes. 11 Here, we present results of an integrated computational pipeline to identify conserved 12 sORFs using comparative genomics. We greatly extended our previously published 13 approach 10 and applied it to the entire transcriptome of five animal species: human (H.  17 annotated as non--coding. By means of comparative and population genomics, we detect 18 purifying selection on the encoded peptide sequence, suggesting that the detected 19 sORFs, of which some are conserved over wide evolutionary distances, give rise to 20 functional micropeptides. We compare our results to published catalogs of peptides 21 from non--annotated regions, to sets of sORFs found to be translated using ribosome 22 profiling, and to a number of computational sORF predictions. While there is often little 23 overlap, we find in all cases consistently stronger conservation for our candidates, 24 confirming the high stringency of our approach. Overall, predicted peptides have little 25 homology to known proteins and are rich in disordered regions and peptide binding 26 motifs which could mediate protein--protein interactions. Finally, we use published high-- 27 throughput datasets to analyze expression of their host transcripts, confirm translation 28 of more than 100 novel sORFs using published ribosome profiling data, and mine in-- 29 house and published mass spectrometry datasets to support protein expression from 30 more than 70 novel sORFs. Altogether, we provide a comprehensive catalog of 31 conserved sORFs in animals to aid functional studies. 32 Results 1 Identification of conserved coding sORFs from multiple species alignments. 2 Our approach, which is summarized in Fig. 1A, is a significant extension of our 3 previously published method 10 . Like most other computational studies, we take an 4 annotated transcriptome together with published lincRNA catalogs as a starting point. 5 We chose the Ensembl annotation (v74), which is currently one of the most 6 comprehensive ones, especially for the species considered here. In contrast to de novo 7 genome--wide predictions 17,43 , we rely on annotated transcript structures including 8 splice sites. We then identified canonical ORFs for each transcript, using the most 9 upstream AUG for each stop codon; although use of non--canonical start codons has been 10 frequently described 15--17,44,45 , there is currently no clear consensus how alternative 11 translation start sites are selected. Next, ORFs were classified according to their location 12 on lincRNAs or on transcripts from protein--coding loci: annotated ORFs serving as 13 positive control; ORFs in 3'UTRs, 5'UTRs or overlapping with the annotated CDS; or on 14 other transcripts from a protein--coding locus lacking the annotated CDS. We ignored 15 pseudogene loci: although pseudogenes have been associated with a variety of biological 16 functions 46--48 , their evolutionary history makes it unlikely that they harbor sORFs as 17 independent functional units encoding micropeptides. 18 Based on whole--genome multiple species alignments, we performed a conservation 19 analysis to obtain four characteristic features for each ORF: most importantly, we scored 20 the depletion of nonsynonymous mutations in the alignment using phyloCSF 49 ; we also 21 evaluated conservation of the reading frame from the number of species in the (un--22 stitched) alignment that lack frameshifting indels; finally, we analyzed the characteristic 23 steps in nucleotide--level conservation (using phastCons) around the start and stop 24 codons by comparing to the mean profile observed in annotated ORFs. Next, we trained 25 a classifier based on support vector machines (see also Crappe et al. 50 for a related 26 approach) on confident sets of conserved small peptides and control sORFs from non-- 27 coding regions: as positive control, we chose conserved small peptides of at most 100 aa 28 from Swiss--Prot with positive phyloCSF score. We discarded a number of presumably 29 fast--evolving peptides: 177 in human and 77 in mouse, which are associated with 30 antimicrobial defense, and 15 in fly of which 11 are signal peptides. As negative control, 31 we chose sORFs on classical ncRNAs such as pre--miRNAs, rRNAs, tRNAs, snRNAs, or 32 snoRNAs. Importantly, both of these sets overlap with a sizable number of genomic 33 regions that are highly conserved on the nucleotide level (phastCons conserved 34 elements; Fig. S1A). While each of the four conservation features performs well in 35 6 discriminating positive and negative set (Fig. S1B), their combination in the SVM 1 reaches very high sensitivity (between 1--5% false negative rate) and specificity (0.1--2 0.5% false positive rate) when cross--validating our training data ( Fig. 1B and Fig. S1B). 3 The classifier is dominated by the phyloCSF score (Fig. S1B), but the additional 4 conservation features help to reject sORFs on annotated pseudogene transcripts, which 5 typically do not show characteristic steps in nucleotide conservation near start or stop 6 codons ( Fig. 1B inset). 7 We noted that known small proteins typically reside in distinct genomic loci, while many 8 predicted ORFs on different transcript isoforms overlap with one another or with 9 annotated coding exons. Therefore, we aimed to remove candidates where the 10 conservation signal could not be unambiguously assigned. We thus implemented a 11 conservative overlap filter by excluding ORFs overlapping with conserved coding exons 12 or with longer SVM--predicted ORFs (Methods). Most sORFs in 3'UTRs or 5'UTRs pass 13 this filter, but many sORFs from different mRNA and lincRNA isoforms are collapsed, 14 and most sORFs (85--99%) overlapping with annotated coding sequence are rejected 15 ( Fig. 1C and Fig. S1D). 16 Hundreds of novel conserved sORFs, typically much smaller than known and myoregulin 31 in human. We can confirm that many transcripts annotated as 27 lincRNAs in fact code for proteins. However, it is a relatively small fraction (1--7%) that 28 includes transcripts in intermediate categories, such as TUCPs in human 53 and RITs in C. 29 elegans 54 . Further, we note that a sizable number of uORFs are predicted to encode 30 functional peptides, including the known case of MKKS 55 . Finally, we observe that the 31 great majority of predicted sORFs is much smaller (median length 11 aa for 3'UTR 32 sORFs in C. elegans to 49 aa for lincRNA sORFs in D. rerio) than annotated sORFs 33 (median length 81--83 aa), with sORFs in 3'UTRs and 5'UTRs typically being among the 34 shortest. 35 7 We assembled relevant information for the identified sORFs including coordinates, 1 sequences, transcript models, and features analyzed in the following sections in 2   Supplementary Tables 1--5.   3 Novel sORFs are under purifying selection on the amino acid level 4 Since selection on the level of the encoded amino acid sequence permits synonymous 5 sequence variation, we compared length--adjusted phyloCSF scores of predicted sORFs 6 to those of control ORFs matched for their nucleotide--level conservation ( Fig. 2A;   7 methods). As expected from the design of our pipeline, we find that novel predicted 8 sORFs are specifically depleted of nonsynonymous mutations, and in most cases to a 9 similar extent as annotated ones. We also collected polymorphism data to perform a 10 similar but independent test on a population genomics level: aggregating SNPs from all 11 predicted sORFs (novel or annotated), we measured the dN/dS ratio and found that 12 nonsynonymous SNPs are suppressed compared to synonymous ones to a greater extent 13 than in control regions ( Fig. 2B; Methods). This depletion is less pronounced than for 14 annotated small proteins, and the associated p--values are lower in the species with 15 higher SNP density (mouse and fruit fly with 16 and 45 SNPs/kb in the control regions) 16 than in zebrafish or C. elegans with 1.7 and 2.0 SNPs/kb, respectively. It fails to pass the 17 significance threshold in human with 2.4 SNPs/kb, where we get p=0.076 as the larger 18 value from reciprocal X 2 tests. 19 These results confirm that predicted sORFs permit synonymous more than 20 nonsynonymous sequence variation when comparing within or between species, 21 suggesting that selection acts on the level of the encoded peptide sequence and 22 therefore implying functional peptide products. 23 Some novel sORFs are widely conserved 24 We next sought to evaluate how widely the predicted sORFs are conserved. First, we 25 took an alignment--based approach: we inferred most recent common ancestors from the 26 alignment by tallying the species with conserved start and stop codons and (if 27 applicable) splice sites, and without nonsense mutations. This analysis is dependent on 28 the accuracy of the alignment, but it does not require transcript annotation in the 29 aligned species. Using this method (see Fig. 2C) we find that after annotated small 30 proteins, uORFs are most widely conserved, followed by the other sORF types. Of the 31 novel sORFs found in human, 342 are conserved in placental mammals and 39 in the 32 gnathostome ancestor (i.e., in jawed vertebrates). 18 are found conserved in teleosts, 49 33 in Drosophilids, and 88 in worms of the Elegans group. 34 8 We also addressed this question with a complementary analysis: we performed a 1 homology clustering of sORFs predicted in the different species using a BLAST--based 2 approach adapted for short amino acid sequences (Methods). This analysis clusters 3 1445 of in total 3986 sORFs into 413 homology groups, and 304 of 2002 novel 4 predictions are grouped into 138 clusters. The clusters containing at least one novel 5 predictions and sORFs from more than one species are summarily shown in Fig. 2D. 6 Some novel predictions cluster together with sORFs annotated in other species, 7 confirming the reliability of our approach and extending current transcriptome 8 annotations. For instance, several zebrafish lincRNAs are found to encode known small 9 proteins such as cortexin 2, nuclear protein transcriptional regular 1 (NUPR1), small 10 VCP/p97--interacting protein (SVIP), or centromere protein W. Conversely, some 11 lincRNAs from mouse and human encode small peptides with annotated (yet often 12 uncharacterized) homologs in other species. Further, a sORF in the 3'UTR of murine 13 Zkscan1 encodes a homolog of Sec61 gamma subunit in human, mouse, fish and fly. Also, 14 a sORF in the 5'UTR of the worm gene mnat--1 encodes a peptide with homology to 15 murine lyrm4 and the fly gene bcn92. 16 We also find 109 clusters of entirely novel predictions, such as 29 sORFs in 5'UTRs and invertebrates. Two clusters of unclear significance, consisting mainly of sORFs in the 28 3'UTRs of zinc--finger proteins, share a common HTGEK peptide motif, a known 29 conserved linker sequence in C2H2 zinc fingers 56 . Finally, we note that our sequence--30 based approaches cannot resolve structural and/or functional homologies that persist 31 despite substantial sequence divergence as observed between different animal peptides 32 interacting with the Ca 2+ ATPase SERCA 31,32 , or between bacterial homologs of the E. coli 33 CydX protein 57 . We expect that further homologies between the predicted sORFs could 34 be uncovered using more specialized approaches. 35 Taken together, this conservation analysis shows that novel sORFs are often widely 1 conserved on the sequence level; further functional homologies could exist that are not 2 detectable by sequence 31,32 . 3 Conserved sORFs are predicted with high stringency 4 Many recent studies have addressed the challenge of identifying novel small protein--5 coding genes by means of computational methods or high--throughput experiments. 6 These studies were performed in different species with different genome annotations, 7 searching in different genomic regions, allowing different length ranges and using often 8 quite different underlying hypotheses, for instance with respect to non--canonical start 9 codons. Accordingly, they arrive at very different numbers. To reconcile these different 10 approaches, we inclusively mapped sORFs defined in 15 other studies with published 11 lists of coordinates, sequences or peptide fragments, to the comprehensive set of 12 transcriptomic ORFs analyzed here (Supplementary Table 6 grouped studies by methodology, and by organism and genomic regions analyzed. We 17 then compared sORFs predicted in our study but not in others to sORFs that were 18 predicted elsewhere and analyzed but rejected here (Fig. 3). We used our results before 19 applying the overlap filter. Considering changes in annotation (e.g., of coding sequences, 20 lincRNAs and pseudogenes), we only compared to those sORFs that we analyzed and 21 classified into the corresponding category. Generally, we find rather limited overlap 22 between our predictions and results from other studies, which is partially explained by 23 differences in applied technique and underlying hypothesis. We also find that the sORFs 24 that we predict for the first time have consistently much higher length--adjusted 25 phyloCSF scores than those found in other studies but rejected in ours; in many cases, 26 we also find that the dN/dS ratio of nonsynonymous vs. synonymous SNP density is 27 lower, albeit in a similar number of cases there is not enough data to render the p--value 28 significant (we used the larger one from reciprocal X 2 --tests). 29 First, we compared to a study using ribosome profiling in zebrafish 30 , with similar 30 overlap as reported in our previous publication 10 , the results of which are re--analyzed 31 with the updated transcriptome annotation for comparison ( Fig. 3A--B). Ribosome 32 profiling provides evidence of translation in the cell types or developmental stages 33 analyzed, but in addition to coding sORFs it also detects sORFs with mainly regulatory 34 functions such as uORFs. Next, we compared to 7 studies employing mass 35 spectrometry 17--22 : matching given protein sequences or re--mapping detected peptides to 1 the set of sORFs analyzed here, we find only between 1 and 12 common results from 2 between 3 and almost 2000 sORFs (Fig. 3C--E). Note that up to 62% of peptides 3 identified in these studies come from pseudogene loci which we excluded. While mass 4 spectrometry provides direct evidence for peptide products, it is also performed in 5 specific cell lines or tissues and has limited dynamic range. This can prevent detection of 6 small peptides, which might be of low abundance or half--life, or get lost during sample 7 preparation. Both experimental methods cannot distinguish sORFs coding for conserved 8 micropeptides from those coding for lineage--specific or fast--evolving functional 9 products. It is thus not surprising that these sORFs are as a group less conserved than 10 the ones found using conservation as a selection criterion. 11 Next, we compared our results against other computational studies 18,43,50,58--60 . Here, we 12 can often match much larger number of sORFs, but except for predictions of the CRITICA 13 pipeline in mouse cDNAs 58 , we again find only limited overlap: we predict between 0 14 and 23% of analyzed sORFs found elsewhere, indicating a high variability in different 15 computational methods, even though many of them use evolutionary conservation as a 16 filter. The consistently better conservation indicators for our results (Fig. 3F--N) confirm 17 that the deeper alignments and sensitive conservation features used here lead to 18 increased performance. However, we remark that our method is not designed to find 19 sORFs in alternative reading frames 61,62 unless their evolutionary signal strongly 20 exceeds what comes from the main CDS (e.g., because it is incorrectly annotated); also, 21 the limited overlap with Ruiz--Orera et al. 60 is not unexpected since their focus was on 22 newly evolved lincRNA sORFs, which are by definition not well conserved. Finally, 23 Crappé et al. 50 and Ladoukakis et al. 43 limited their search to single--exon sORFs, whereas 24 66% and 20% of sORFs predicted by us in the transcriptomes of mouse and fly, 25 respectively, span more than one exon. However, even when restricting the comparison 26 to single--exon sORFs, we find better conservation indicators for our results. 27 Given the consistently higher phyloCSF scores and often better dN/dS ratios of our 28 sORFs when comparing to other studies, we conclude that our results present a high--29 stringency set of sORFs coding for putatively functional micropeptides.

30
Novel peptides are often disordered and enriched for linear peptide motifs 31 We next investigated similarities and differences of sORF--encoded peptides to 32 annotated proteins. First, we used amino acid and codon usage to cluster predicted 33 sORFs, short and long annotated proteins and a negative control consisting of ORFs in 34 non--coding transcriptome regions with small phyloCSF scores. Looking at amino acid 35 usage, we were surprised to find that our novel predictions in four out of five species 1 clustered with the negative control. However, when choosing subsamples of the data, 2 novel predictions also often clustered together with annotated proteins, suggesting that 3 their overall amino acid usage is intermediate. Indeed, the frequencies of most amino 4 acids lie between those of positive and negative control. Interestingly, however, we 5 found that novel predictions clustered robustly with annotated proteins when analyzing 6 codon usage (with the exception of fruit fly). 7 Dissimilarity with annotated proteins was also confirmed when testing for homology to 8 the known proteome using BLAST. Only a small fraction of novel predictions, mainly 9 those in the 'CDS overlap' and 'other' categories, give significant hits (Fig. 4A). While 10 some novel sORFs are homologous to annotated small proteins as revealed by the 11 clustering analysis in Fig. 2C, there is no significant overlap between the sORFs that 12 were assigned to homology clusters and those that have similarity to known proteins 13 (Fisher's p > 0.1 for all species except for C. elegans where p=0.003). Hence, even 14 completely novel sORFs are sometimes conserved over wide distances. 15 We then hypothesized that differences in amino acid composition might give rise to 16 different structural properties. We used IUPred 63 to detect intrinsically unstructured 17 regions, and found that novel predictions are much more disordered than known small 18 proteins or a length--matched negative control (Fig. 4B). This could suggest that the 19 peptides encoded by conserved sORFs adopt more stable structures only upon binding 20 to other proteins, or else mediate protein--protein or protein--nucleic acid interactions 64 . 21 It has recently become clear that linear peptide motifs, which are often found in 22 disordered regions, can be important regulators of protein function and protein--protein 23 interactions 65 . Indeed, when searching the disordered parts of sORF--encoded peptides 24 for matches to motifs from the ELM database 66 , we find that the increased disorder 25 comes with a higher density of such motifs in the predicted peptides ( Fig. 4C), as was 26 also observed recently for peptides identified with mass spectrometry 23 . 27 Since a recent study identified toddler and a number of other predicted signal peptides 28 from non--annotated ORFs 30 , we searched our novel candidates with signalp 67 . Fig. 4D   29 shows that a small number of our predicted sORFs have predicted signal sequences, and 30 that most of these lack trans--membrane domains, but this does not exceed expectations 31 from searching a length--matched control set. However, the typically lower amino acid 32 conservation at the N--terminus of signal peptides could imply that some genuine 33 candidates escape our conservation filters. 34 Taken together, these results show that novel sORF--encoded peptides are different from 35 annotated proteins in terms of amino acid usage and sequence homology, that they are 36 enriched in disordered regions and peptide motifs, and that only few of them encode 1 signal peptides. 2

3'UTR sORFs are not consistently explained by stop--codon readthrough or
3 alternative terminal exons 4 sORFs in 3'UTRs (dORFs) are least likely to be predicted as conserved compared to the 5 other categories (Fig. S1C), but nevertheless we were surprised to find so many of them 6 (between 33 in zebrafish and 229 in human). Although the existence of conserved 7 dORFs was observed before 59 , and translation was also detected in ribosome profiling 10 , 8 to the best of our knowledge there are no known examples of functional peptides 9 produced from 3'UTRs (with the exception of known polycistronic transcripts). 10 Therefore we explored the possibility that these ORFs actually represent conserved 11 read--through events as suggested previously 35,68,69 , or come from non--annotated 12 alternative C--terminal exons. 13 We first checked 283 read--through events in Drosophila previously predicted by 14 conservation 68 , and 350 detected using ribosome profiling 35 Fig. S5D), and a step in conservation near the dORF start codons significantly more 3 pronounced than for control ORFs in 3'UTRs ( Fig. 5E and Fig. S5E). In addition, this 4 observation makes it unlikely that dORFs represent non--annotated alternative terminal 5 exons (where this methionine would not be associated with a conservation step). 6 Further, if such un--annotated exons existed in large numbers, we would expect that at 7 least some of our (pre--overlap filter) predictions overlap with already annotated 8 alternative exons. However, except for Drosophila we only find at most two dORFs with 9 CDS overlap, which is not more than expected compared to non--predicted dORFs ( Fig.   10 5F and Fig. S5F). 11 In sum, these data suggest that our identification of 3'UTR sORFs is not systematically 12 biased by conserved readthrough events or non--annotated terminal exons. Notably, we 13 also identified candidates that clearly represent independent proteins, such as the dORF  Table 7). We then compared mRNAs coding for short proteins and ). This analysis revealed that annotated short proteins come from transcripts with 27 higher expression and lower tissue or stage specificity than long proteins. Conversely, it 28 is well known that lincRNAs are not as highly and widely expressed as mRNAs 53,71 ; we 29 additionally find that lincRNAs with predicted sORFs are more highly and widely 30 expressed than other lincRNAs. This analysis indicates that peptide products of novel 31 sORFs could be of lower abundance than known small proteins, and that profiling 32 translation or protein expression from a limited number of cell lines or tissues might not 33 always yield sufficient evidence. We therefore used several datasets for the subsequent 34 analysis. 35 14 First, we mined publicly available ribosome profiling datasets in various human and 1 mouse tissues or cell lines, and in zebrafish (Supplementary Table 8). Several metrics to 2 identify translated regions from such data have been proposed 9,14--16 ; we rely here on the 3 ORFscore method used in our previous publication 10 , which exploits the frame--specific 4 bias of the 5' positions of ribosome protected fragments to distinguish actively 5 translated regions from those transiently associated with ribosomes or contaminants. It 6 requires relatively deep coverage and a very clear 3 nt periodicity in ribosomal 7 fragments, which is not always easily achievable (e.g., due to species--specific ribosome 8 conformational properties 11,35 ). We evaluated the ORFscore metric for datasets from 9 human (HEK293 cells 45 , KOPT--K1 cells 72 and human brain tissue 73 ), mouse (embryonic 10 stem cells 16 and brain tissue 73 ), and another zebrafish dataset 9 in addition to the one 11 used before 10 . The performance of these datasets was assessed by comparing ORFscore  Fig. 6A shows that predicted lincRNA sORFs have significantly higher ORFscores than 17 the negative control (p--values between 2e--7 and 0.002), and similarly 5'UTR sORFs 18 (p=2.5e--7 to 0.005) and sORFs in the "other" category (p=3.5e--7 to 0.04). sORFs in 19 3'UTRs reach marginal significance in some samples (p=0.02 for mouse brain and 20 zebrafish). Choosing an ORFscore cutoff of 6 as done previously 10 , we find 45 novel 21 sORFs translated in the human datasets, 15 in mouse, and 50 in zebrafish, respectively. 22 We also find evidence for the translation of several non--conserved length--matched 23 control sORFs, indicating that this set could contain lineage--specific or newly evolved 24 coding ORFs or ORFs with regulatory functions. 25 Next, we searched for peptide evidence in mass spectrometry datasets (Supplementary   26   Table 9). We analyzed 3 in--house datasets to be published elsewhere: one for a mix of 3 MaxQuant 83 against a custom database containing our candidates together with protein 33 sequences from UniProt. PSMs (peptide spectrum matches) were identified at 1% FDR, 34 and those mapping to another sequence in UniProt with one mismatch or ambiguous 35 amino acids were excluded. Using this strategy, we recover between 43 and 131 36 15 annotated small proteins per sample and confirm expression for 34 novel predictions in 1 human, 26 in mouse, 2 in zebrafish, 3 in fly and 9 in C. elegans (Fig. 6B). For instance, we 2 obtain PSMs for the recently described myoregulin micropeptide 31 and for the long 3 isoform of the fly tarsal--less gene 26--28 . In total, we find peptidomic evidence for 57 4 lincRNA sORFs. As observed previously in human 17,18 , mouse 23 and zebrafish 10 we also 5 find PSMs for sORFs in 3'UTRs and 5'UTRs. MaxQuant output for PSMs and their 6 supported sORFs is listed in Supplementary Tables 10--14, and the spectra with peak 7 annotation are shown in Supplementary Figures 7--11. 8 In human and mouse, the results for novel predictions have considerable overlap of 17 9 and 8 hits, respectively, indicating that peptides from some sORFs can be reliably 10 detected in multiple independent experiments. We also find more than one peptide for 9 11 and 11 novel sORFs in human and mouse and for one sORF in fly and worm, 12 respectively. Likely as a consequence of the differences in expression on the RNA level 13 (Fig. S6A), the PSMs supporting our novel predictions have generally lower intensities 14 than those supporting the positive control (Mann--Whitney p=4e--9; Fig. S6C). However, 15 we also observed that these PSMs are shorter than those mapping to UniProt proteins to sORFs in the positive control are also re--assigned, even though most of these sORFs 32 have independent evidence from other PSMs. This suggests that targeted mass 33 spectrometry approaches, complementary fragmentation techniques, or validation runs 34 using synthetic peptides 23 should be used to verify expression of ambiguous candidates. 35 16 In summary, we cross--checked our predictions against a variety of high--throughput 1 data: RNA--seq indicates that sORF--harboring lincRNAs are not as highly and widely 2 expressed as other mRNAs, but more than lincRNAs without conserved sORFs. 3 Analyzing ribosome profiling and mass--spectrometry data, we find evidence for 4 translation and protein expression from 110 and 74 novel sORFs, respectively, across all 5 datasets. 6

7
In our search for functional sORF--encoded peptides, we followed the idea that 8 evolutionary conservation is a strong indicator for functionality if the conservation 9 signal can be reliably separated from background noise and other confounding factors, 10 such as overlapping coding sequences or pseudogenes. We therefore used conservation 11 features that are very specific to known micropeptides (and canonical proteins), namely 12 a depletion of nonsynonymous mutations, an absence of frameshifting indels, and 13 characteristic steps in sequence conservation around start and stop codon. We then 14 chose confident sets of positive and negative control sORFs, both of which have many 15 members that are highly conserved on the nucleotide level, and combined these features 16 into a machine learning framework with very high sensitivity and specificity. 17 Importantly, our refined pipeline also achieves a more reliable rejection of sORFs on 18 pseudogene transcripts. Pseudogenes are important contaminants since frequent 19 intervening stop codons imply that many of the resulting ORFs are short. While many 20 pseudogenes are translated or under selective constraint, 48 sORFs in these genes 21 probably do not represent independent functional or evolutionary units. 22 Our integrated pipeline identifies sORFs comprehensively and with high accuracy, but 23 we want to highlight a number of caveats and avenues for future research. First, the 24 scope and quality of our predictions depends on the quality of the annotation: in some 25 species, pseudogenes, lincRNAs and short ncRNAs (especially snoRNAs and snRNAs) 26 have been characterized much more comprehensively, explaining some of the 27 differences in the numbers seen in Fig. 1D. For instance, a recent study suggests that 28 incomplete transcriptome assembly could lead to fragmented lincRNA identifications 29 that obscure the presence of longer ORFs. 86 Second, the quality of the predictions 30 depends on the choice of the training data: while we aimed to choose negative controls 31 that are transcribed into important RNA species and therefore often conserved on the 32 nucleotide level, the training set is inevitably already separable by length alone, since 33 there are only very few known small peptides below 50 aa, and very few ORFs on 34 ncRNAs longer than that. A larger number of functionally validated very short ORFs 1 would help to more confidently estimate prediction performance in this length range. 2 Third, we remark that in some cases segmental duplications and/or genomic repeats 3 give rise to a number of redundant sORFs, for instance in a 50kb region on zebrafish 4 chromosome 9, or on chromosome U in flies. Fourth, our analysis is currently limited to 5 finding canonical ORFs, even though usage of alternative initiation codons could be 6 widespread 15--17,44,45 . Alternative start codon usage might even produce specific 7 conservation signals that could be leveraged to confidently identify ORF boundaries. 8 Fifth, our approach is limited by the quality of the multiple species alignment: while the 9 micropeptides characterized so far have very clear signatures allowing an alignment--10 based identification, there could be many instances where sequence conservation within 11 the ORF and its flanking regions is not sufficient to provide robust anchors for a multiple 12 alignment. For instance, functionally homologous micropeptides can be quite diverged 13 on the sequence level. If additional homologous sequence regions can be reliably 14 identified and aligned, a codon--aware re--alignment of candidate sequences 87 could also 15 help to improve detection power. Further, we currently only tested for a depletion of 16 nonsynonymous mutations, but more sensitive tests could be implemented in a similar 17 way 49 . 18 Sixth, since we did not find sORFs from our positive control or other known 19 micropeptides to overlap with each other or longer ORFs, we used a quite conservative 20 overlap filter to choose from each genomic locus one ORF most likely to represent an 21 independent evolutionary and functional unit. This filter could be too restrictive: most 22 importantly for sORFs overlapping annotated long ORFs in alternative reading frames, 23 but also when the CDS annotation is incorrect, or for the hypothetical case that a 24 micropeptide has multiple functional splice isoforms. 25 Finally, we specifically examined 3'UTR sORFs, for which mechanisms of translation are 26 unclear. A very small number of cases could be explained by read--through or alternative 27 exons, but we did not observe global biases. Depending on the experimental conditions, 28 3'UTR ribosome occupancy can be observed in Drosophila and human cells, but it has 29 not been linked to active translation 36 . However, some mechanisms for downstream 30 initiation have been proposed 88,89 , ribosome profiling gives evidence for dORF 31 translation in zebrafish 10 , and some peptide products are found by mass--spectrometry 17--32 19,22,23 . Of course, the distinction between uORFs, main CDS, and dORFs becomes blurry 33 for polycistronic transcripts. 34 To assess putative functionality of the encoded peptides, we tested our candidates for 35 signatures of purifying selection; in addition to the expected depletion of 36 nonsynonymous mutations in the multiple alignment when comparing to conservation--1 matched controls, we also found a weaker (but in many cases highly significant) 2 depletion of nonsynonymous SNPs. A closer look at conservation statistics of identified 3 sORFs revealed that many novel predictions are widely conserved between species (e.g., 4 almost 350 in placental mammals and almost 40 in jawed vertebrates). By means of 5 homology clustering, we observed that some of these novel predictions are actually 6 homologous to known proteins, but we also found a sizable number of widely conserved 7 uORFs and dORFs. Based on sequence homology, we could identify 6 novel predictions 8 that are conserved between vertebrates and invertebrates. This small number is to be 9 expected, since only two of 105 known annotated small proteins similarly conserved are 10 shorter than 50 aa (OST4, a subunit of the oligosaccharyltransferase complex, and 11 ribosomal protein L41), and only a minority of our predicted sORFs is longer than that 12 (about 40% for zebrafish and 20% for the other species). Based on recently discovered 13 functional and structural similarities between different SERCA--interacting 14 micropeptides 31,32 , we expect that additional deep homologies between novel 15 micropeptides might emerge in the future. 16 We also performed a systematic comparison to 15 previously published catalogs of sORF 17 identifications, both computational and by means of high--throughput experiments.  34 Due to these and other ambiguities, a relatively limited overlap is not unexpected when 35 combining computational and experimental approaches 10 : for instance, ribosome 36 profiling provides a comprehensive snapshot of translated regions in the specific cell 1 type, tissue and/or developmental stage analyzed. This includes sORFs that are 2 translated for regulatory purposes or coding for fast--evolving or lineage--specific 3 peptides such as the small proteins with negative phyloCSF scores excluded from our 4 positive control set. A similar caveat applies to mass--spectrometry, which provides a 5 more direct test of protein expression but has lower sensitivity than sequencing--based 6 approaches, especially for low--molecular--weight peptides. The matching of measured 7 spectra to peptide sequences is also nontrivial. Especially in deep datasets, low--quality available data, we could confirm translation of more than 100 conserved sORFs in 23 several vertebrate ribosome profiling datasets using a stringent metric (ORFscore 10 ), 24 which exploits that actively translated regions lead to a pronounced 3 nt periodicity in 25 the 5'ends of ribosome protected fragments. We also analyzed a number of published 26 and in--house mass spectrometry datasets, and found peptidomic evidence for more than 27 70 novel candidates. 28 In conclusion, we present a comprehensive catalog of conserved sORFs in the 29 transcriptomes of five animal species. In addition to recovering known small proteins, 30 we discovered many sORFs in non--coding transcriptome regions. Most of these novel 31 sORFs are very short and some are widely conserved between species. Based on the 32 observation that encoded micropeptides are often disordered and rich in protein 33 interaction motifs, we expect that they could function through protein--protein or 34 protein--nucleic acid interactions. Given robust and confident signatures of purifying 35 20 selection, and experimental evidence for translation and protein expression, our 1 findings provide a confident starting point for functional analyses in vivo. 4 For all species, we used the transcript annotation from Ensembl (v74). Additionally, we 5 used published lincRNA catalogs for human 53,91 , mouse 92 , zebrafish 38,93 and fruit fly 94 , 6 and added modENCODE 54 transcripts for C. elegans. 7 We downloaded whole genome multiple species alignments from the UCSC genome coding sequence of a protein--coding transcript (i.e., biotype "protein coding", and a 20 coding sequence starting at the most upstream AUG, without selenocysteins, read--21 through or frameshift events). We classified ORFs as "pseudogene" if a member of a 22 group came from a transcript or a gene locus annotated as pseudogene. We designated 23 as "ncRNA" ORFs (negative controls) those with biotypes miRNA, rRNA, tRNA, snRNA or 24 snoRNA. Next, "3'UTR" ORFs were classified as such if they resided within the 3'UTRs of 25 canonical protein--coding transcripts, and if they did not overlap with annotated CDS 26 (see below). Analogously, we assigned "5'UTR" ORFs. In the category "CDS overlap" we 27 first collected ORFs that partially overlapped with 3'UTR or 5'UTR of canonical coding 28 transcripts. ORFs in the "other" category were the remaining ones with gene biotype 29 "protein coding", or non--coding RNAs with biotypes "sense overlapping", "nonsense--30 mediated decay", "retained intron" or other types except "lincRNA". Only those non--31 coding RNAs with gene and transcript biotype "lincRNA" were designated "lincRNA". To 32 exclude the possibility that alternative reading frames could be translated on transcripts 33 lacking the annotated CDS, we finally added those ORFs that were completely contained 1 in the annotated CDS of canonical transcripts to the "CDS overlap" category if other 2 group members did not fall into the "other" category. Transcripts not from Ensembl 3 were generally designated lincRNAs, except for C. elegans: in this case, we merged the 4 modENCODE CDS annotation with Ensembl, and classified only the "RIT" transcripts as 5 non--coding, while the ones that did not match the Ensembl CDS annotation were put in 6 the "other" category. We then added Swiss--Prot and TrEMBL identifiers from the 7 UniProt database (Nov 18 2014) to our ORFs by matching protein sequences. 8 Predicting conserved sORFs using a SVM 9 From the multiple alignments for each ORF, we extracted the species with at least 50% 10 sequence coverage and without frameshifting indels (using an insertion index prepared 11 before stitching alignment blocks), recording their number as one feature. score. The negative set consisted of sORFs from the "ncRNA" category with alignment 25 coverage, but without overlap with annotated CDS. 26 We estimated the performance of the classifier by 100 re--sampling runs, where we 27 chose training data from positive and negative set with 50% probability and predicted 28 on the rest. Prediction of pseudogene sORFs (inset of Fig. 1B) was done either with the 29 SVM, or based on the phyloCSF score alone, using a cutoff of 10 estimated from the 30 minimum average error point in the ROC curve.

31
Overlap filter 32 Refining our previous method, we designed an overlap filter as follows: in the first step, 33 we only kept annotated sORFs or those that did not intersect with conserved coding 34 exons. Here we took among the annotated coding exons in Ensembl (v74) or RefSeq (Sep 1 2 2014 for mouse, April 11 2014 for the other species) only those with conserved 2 reading frame, requiring that the number of species without frameshifting indels 3 reaches a threshold chosen from the minimum average error point in the ROC curves of 4 Fig. 1B and S1 (11 species for human, 10 for mouse, 4 for zebrafish, 7 for fruit fly, and 2 5 for worm). In a second step we also required that the remaining ORFs were not 6 contained in a longer ORF (choosing the longest one with the best phyloCSF score) that 7 itself was predicted by the SVM and did not overlap with conserved coding exons. were absent. We then inferred the common ancestors of these species and plotted the 1 fraction of ORFs with common ancestors at a certain distance to the reference species. 2 For the homology graph in Fig. 2D we blasted sORF amino acid sequences from the 3 different reference species against themselves and each other (blastp with options "--4 evalue 200000 --matrix PAM30 --word_size=2"). We then constructed a directed graph by 5 including hits between sORFs of similar size (at most 20% deviation) for E--value < 10 6 and an effective percent identity PIDeff greater than a dynamically adjusted cutoff that 7 required more sequence identity between shorter matches than longer ones (PIDeff = partial overlap means more than 50% but less than 99% on the nucleotide level.

22
Comparison to other studies 23 We obtained results from other studies in different formats (Supplementary Table 6). 24 Tryptic peptide sequences were mapped against the set of ORFs we analyzed (requiring 25 preceding lysine or arginine). Amino acid sequences were directly matched to our set of 26 ORFs, and ORF coordinates were matched to our coordinates (in some cases after 27 conversion between genome versions or the removal of duplicate entries). Since 28 different studies used different annotations and different length cutoffs, we then 29 excluded from the matched ORFs the ones not in the category under consideration, e.g., 30 longer ORFs, or sORFs that have since then been annotated or with host transcripts 31 classified as pseudogenes. The remaining ones were compared to our set of predictions. the hits with E--value > 10 --5 , percent identity > 50 and query coverage > 80% the best hit 3 (based on percent identity) to entries of the same or larger length that were not flagged 4 as "PREDICTED", "hypothetical", "unknown", "uncharacterized", or "putative". 5 For the disorder prediction in Fig. 4B, we used IUPred 63 in the "short" disorder mode 6 and averaged disorder values over the sequence. For the motif discovery in Fig. 4C 26 For all sORFs in the 3'UTR we obtained the annotated CDS of the respective transcript. 27 We then computed the step in the phastCons conservation score (average over 25 nt 28 inside minus average over 25 nt outside) at the stop codon of the annotated CDS and 29 compared protein--coding transcripts with dORFs that are predicted and pass the 30 overlap filter against other protein--coding transcripts. Similarly, we compared the step 31 around the start codons of dORFs. We also compared the distance between the 32 annotated stop and the start of the dORF, the distribution of the reading frame of the 33 dORF start with respect to the annotated CDS, and the number of intervening stops in 34 the frame of the annotated CDS. We finally checked how many predicted dORFs before 35 applying the overlap filter overlap with annotated coding sequence and compared 1 against the remaining dORFs. Analysis of published ribosome profiling data 21 We obtained published ribosome profiling data as summarized in Supplementary Table   22 8. Sequencing reads were stripped from the adapter sequences with the Fastqx toolkit. 23 The trimmed reads aligning to rRNA sequences were filtered out using bowtie. The 24 remaining reads were aligned to the genome using STAR, allowing a maximum of 5 25 mismatches and ignoring reads that mapped to more than 10 different genomic 26 locations. To reduce the effects of multi--mapping, alignments flagged as secondary 27 alignments were filtered out. We then analyzed read phasing by aggregating 5' read 28 ends over 100 nt windows around start and stop of annotated coding sequences from 29 Ensembl to assess dataset quality and obtain read lengths and 5' offsets for use in 30 scoring. From the datasets in Supplementary Table 8 we calculated the ORFscore as 31 described previously 10 , pooling the reads from all samples if possible.

32
Analysis of in--house and published mass spectrometry datasets 1 We used three in--house generated mass spectrometry datasets that will be published 2 elsewhere: one in a mixture of HEK293, HeLa and K562 cells, one in a mixture of HepG2, 3 MCF--10A, MDA--DB and MCF7 cells, and one in mouse C2C12 myoblasts and myotubes. 4 Further, we mined published datasets in HEK293 cells from Eravci et al. 75 26 We thank the N Rajewsky lab for fruitful discussion, and Fabian Bindel for sharing 27 unpublished mass--spec datasets. SDM and NR thank Francois Payre for initial  for novel predicted sORFs is smaller than for control ORFs in non--coding regions of the 5 transcriptome, but larger than for annotated sORFs. C) Percentage of sORFs conserved 6 in ancestral species as inferred from the multiple species alignment. Numbers for 7 informative ancestors are indicated (e.g., the ancestors of primates, placental mammals 8 and jawed vertebrates for H. sapiens). Symbols mark different reference species as in D). 9 D) homology clustering of predicted sORFs in different species; only clusters with at 10 least one non--annotated member and members from more than one species are shown, 11 with multiplicity indicated. *** p < 0.001; ** p < 0.01; * p < 0.05; Mann--Whitney tests in 12 A, reciprocal Χ 2 tests in B. analyzing sORFs in different organisms and genomic regions, the numbers of predicted 6 sORFs that are also predicted here (before overlap filter) or at least analyzed, 7 respectively, are given. phyloCSF scores and dN/dS ratios are compared for the sORFs 8 that are predicted either here or in another study but not in both. tw: this work. *** p < 9 0.001; ** p < 0.01; * p < 0.05, using Mann--Whitney (phyloCSF scores) and reciprocal Χ 2 10 tests (dN/dS), respectively. PSMs supporting annotated sORFs and peptides supporting novel predicted sORFs, 7 aggregated over all datasets after log--transformation and normalization (z--score) 8 relative to PSMs mapping to UniProt proteins. D) PSM length, E) Andromeda score, F) 9 peak intensity coverage and G) peak coverage for PSMs as in C. *** p < 0.001; ** p < 0.01; 10 * p < 0.05 (Mann--Whitney tests) 11 A B C D E F G