Universal Features in the Genome-level Evolution of Protein Domains

Protein domains are found on genomes with notable statistical distributions, which bear a high degree of similarity. Previous work has shown how these distributions can be accounted for by simple models, where the main ingredients are probabilities of duplication, innovation, and loss of domains. However, no one so far has addressed the issue that these distributions follow definite trends depending on protein-coding genome size only. We present a stochastic duplication/innovation model, falling in the class of so-called Chinese Restaurant Processes, able to explain this feature of the data. Using only two universal parameters, related to a minimal number of domains and to the relative weight of innovation to duplication, the model reproduces two important aspects: (a) the populations of domain classes (the sets, related to homology classes, containing realizations of the same domain in different proteins) follow common power-laws whose cutoff is dictated by genome size, and (b) the number of domain families is universal and markedly sublinear in genome size. An important ingredient of the model is that the innovation probability decreases with genome size. We propose the possibility to interpret this as a global constraint given by the cost of expanding an increasingly complex interactome.


A2. RESULTS FOR FOLD DOMAIN CLASSES
All data shown in the main text refer to the superfamily taxonomy level, and come from the SUPER-FAMILY database. In this section, we report the results of the same analysis in terms of SCOP folds, which show that this category has essentially the same behavior as the previous one (figure A2.4). While by definition there are more superfamilies than folds, the number of domain classes versus genome size has very similar scaling in the two cases. The two plots collapse almost exactly, when folds are rescaled by the

GENE.
The most recent version (1.73) of the SUPERFAMILY genome assignments, provides domain associations by scoring the longest transcript per gene. Figure A3.6 reports the behavior of the number of domain classes in these data. While proteome sizes are overestimated roughly by a factor of two due to alternative splicing, the collective behavior and the sublinear trend of F (n) do not change.

A4. CRP MODEL WITH SPECIFIC DOMAIN CLASSES AND ANALITYCAL MEAN FIELD EQUATIONS
In this section we discuss the variant of the CRP model introduced in the main text and its analytical treatment. We first give some more details on the definition of the model. Generically, we consider the following genetic algorithm. For each genome size n, the configuration is a set of M genomes The cost function, for a generic model genome g, can be a function F(g), that takes into account some phenomenological features observed in the data. We choose to include in F a minimal amount of empirical information on the occurrence of each domain class contained in figure 1C. In other words, we distinguish between "universal" domain classes, used in most of the genomes, and "contextual" ones, occurring only in a few examples. As discussed in the main text, this is sufficient to obtain quantitative agreement with the observed domain distributions (figures 1B and 2B), which are not given to the model as an input. If domain classes are indexed by i = 1..D (D = 1443 for Superfamilies), we define the variable σ g i as follows The cost function of that genome is then defined as is the empirical average of the same observable: In the above formula G is the number of observed genomes in the data set. For example, in the case of prokaryotes in the SUPERFAMILY database, G = 327 and, calling Ξ i the function plotted in figure 1C, we For the analitycal treatment, we considered the case M = 1, q = 2, where at each iteration, one genome is selected from a population of two. Starting from configuration g(n), in the proliferation step genomes g , g are generated with CRP rules, and the selection step chooses g(n + 1) = argmax(F(g ), F(g )). In this case, since the selection rule chooses strictly the maximum, it is able to distinguish the sign of σ EMP i only. For this reason, it is sufficient to account for the positivity (which we label by "+") and negativity ("-") of this function for a given domain index i. Note that this reduces the effective parameters to one only: the fraction of universal domain classes. The genomes g and g proposed by the CRP proliferation step can have the same (labeled by "1"), lower ("1 + ") or higher ("1 − ") cost than their parent, depending on p O , p N and by the probabilities to draw a universal or contextual domain family, p + and p − respectively. Using these labels, the scheme of the possible states and their outcome in the selection step is given by the table below.
proliferation (g , g ) probability selection (1, 1) From this table, it is straightforward to derive the modified probabilitiesp O andp N of the complete iteration:p are the probabilities that the new domain is drawn from the universal or contextual families respectively.
We now write the macroscopic evolution equation for the number of domain families using the same procedure as in the main text. Calling k + (n) and k − (n) the number of domain classes that have positive or negative σ EMP i and are not represented in g(n), (1) The above equations have the following consistency properties • ∂ n F ≤ 1, hence F (n) ≤ n. • ∂ n F ≥ 0, ∂ n k + ≥ 0 and ∂ n (F + k + ) ≥ 0 so that F grows faster than k + decreases.
Choosing the initial conditions from empirical data n 0 , F (n 0 ) size and number of domain classes of the smallest genome, we have, since F (n 0 ) < n 0 and α ≤ 1, It is simple to verify that under this condition the system always has solutions that relax to a finite value  Figure A4.8B, shows that, for large values of α (above 0.7) this function reaches a maximum at sizes between 2000 and 4000. This is also where most of the genomes in the data set are found, indicating that this range of genome sizes may allow the optimal usage of universal and contextual domain families.

A5. OTHER VARIANTS OF THE CRP
We discuss here mean-field arguments for the robustness of our results on the asymptotics of F (n) for two variants of the original model, including a small domain loss rate and global duplications.
a. Global Duplications. One can consider the presence of global duplication moves. At each time step, if duplication is chosen, a number of domains selected with q > 1 trials from a binomial distribution with parameter p i O is duplicated in the same time step. The innovation step remains the same. In this case, it is not possible to measure time with the size n of the genome, but this observable follows the evolution where˙indicates the derivative with respect to time t. In terms of t, our mean field equations are worked out simply asḞ (t) = p N andK i (t) = qp i O . Using Eq.
(2), they can be simply converted in terms of n, yielding ∂ n F (n) = αF (n) + θ qn + (q − 1)αF (n) + θ , The first equation gives as leading scaling F (n) ∼ n (α/q) , showing that the growth of F is pushed towards effectively lower values of α by global duplications, as a consequence of the rescaling of time by the global moves. The dynamics for K i , instead, is affected only by a renormalization of the parameter θ. The qualitative results of the model are therefore stable to the introduction of a global duplication rate, in the hypothesis that the extent of these duplications does not scale with n.
b. Domain Loss. A second interesting variant of the model considers the introduction of a homogeneous domain deletion, or loss rate. Domain loss is known to occur in genomes. However, it is not considered in our basic model for simplicity and economy of parameters. In order to introduce it in the CRP, we define a loss probability p L = δ. This is equally distributed among domains, so that the per class loss probability is p i L = δ K i n . Consequently, the duplication and innovation probability p O and p N are rescaled by a factor (1 − δ). The mean-field evolution equation for the number of domain classes becomeṡ where the sink term for F derives from domain loss in classes with a single element, quantified by F (1, n).
In order to solve this equation, one needs an expression for F (1, n). Here, we report an argument based on the fact that in direct simulation of the model, for large n, F (1, n) = γF (n), with 0 < γ < 1 (data not shown). This trend is also confirmed by the empirical data. Using this experimentally motivated ansatz, we can show that for small δ, the scaling of F (n) is subject only to a small correction. In the model including domain deletions, more genomes of the same history can have the same size. Again, since model time t (which should be regarded as a fictitious variable, with a complex relation with evolutionary time in generations) does not correspond genome size, one has to consider the evolution of n with t, given in this model simply byṅ = 1 − 2δ. Using this equation it is possible to obtain the evolution equation for F (n).
Considering an expansion in small δ and large n, this reads to first order ∂ n F (n) F (n) = α n 1 + δ α − γ α .
The above equation gives the conventional scaling for F (n), with the aforementioned correction. Note that the correction could be positive or negative, depending on the relative values of α and γ. An analogous argument holds for α = 0.