The Genomic Basis of Arthropod Diversity

Gregg W.C. Thomas, Elias Dohmen, Daniel S.T. Hughes, Shwetha C. Murali, Monica Poelchau, Karl Glastad, Clare A. Anstead, Nadia A. Ayoub, Phillip Batterham, Michelle Bellair, Gretta J. Binford, Hsu Chao, Yolanda H. Chen, Christopher Childers, Huyen Dinh, HarshaVardhan Doddapaneni, Jian J. Duan, Shannon Dugan, Lauren A. Esposito, Markus Friedrich, Jessica Garb, Robin B. Gasser, Michael A.D. Goodisman, Dawn E. Gundersen-Rindal, Yi Han, Alfred M. Handler, Masatsugu Hatakeyama, Lars Hering, Wayne B. Hunter, Panagiotis Ioannidis , Joy C. Jayaseelan, Divya Kalra, Abderrahman Khila, Pasi K. Korhonen, Carol Eunmi Lee, Sandra L. Lee, Yiyuan Li, Amelia R.I. Lindsey, Georg Mayer, Alistair P. McGregor, Duane D. McKenna, Bernhard Misof, Mala Munidasa, Monica Munoz-Torres, Donna M. Muzny, Oliver Niehuis, Nkechinyere Osuji-Lacy, Subba R. Palli, Kristen A. Panfilio, Matthias Pechmann, Trent Perry, Ralph S. Peters, Helen C. Poynton, Nikola-Michael Prpic, Jiaxin Qu, Dorith Rotenberg, Coby Schal, Sean D. Schoville, Erin D. Scully, Evette Skinner, Daniel B. Sloan, Richard Stouthamer, Michael R. Strand, Nikolaus U. Szucsich, Asela Wijeratne, Neil D. Young, Eduardo E. Zattara, Joshua B. Benoit, Evgeny M. Zdobnov, Michael E. Pfrender, Kevin J. Hackett, John H. Werren, Kim C. Worley, Richard A. Gibbs, Ariel D. Chipman, Robert M. Waterhouse, Erich Bornberg-Bauer, Matthew W. Hahn, Stephen Richards


Introduction
Arthropods (hexapods, crustaceans, myriapods and chelicerates) constitute the most species-rich and diverse phylum on Earth, having adapted, innovated, and expanded into all major habitats within all major ecosystems. They are found as detritivores, herbivores, carnivores, and parasites. As major components of the world's biomass, their diversity and ubiquity lead naturally to significant interactions with humanity, as disease vectors, crop pests and pollinators, food sources, and synanthropes. Despite their diversity, arthropods share a deeply conserved and highly modular body plan. They are bilaterally symmetrical, with serially repeated segments along the anterior-posterior axis. Many segments bear paired appendages, which can take the form of jointed legs, gills, feeding appendages or antennae. Many arthropods have evolved specialized secretions such as venom or silk, extruded from dedicated structures that further capitalize on this segmental modularity. Arthropods also have a hard exoskeleton, composed mostly of chitin, which molts as the animal grows in size. One group of arthropods, the winged insects (Pterygota), took to the skies, bearing up to two pairs of wings as outgrowths of that exoskeleton.
The extraordinary diversity of arthropods is manifested in a series of genomic changes and innovations selected for throughout their evolutionary history. However, linking this phenotypic diversity to underlying genomic changes remains an elusive challenge. The major transitions in arthropod evolution include the differential grouping of body segments into morphological units with a common function (e.g., head, thorax, and abdomen in the Hexapoda) in different taxa, the independent and parallel process of terrestrialization following freshwater colonizations from marine origins in several lineages 1,2 , the emergence of active flight in insects 3,4 , and the evolution of insect metamorphosis 5 . Multiple genomic mechanisms might be responsible for such innovations, but the underlying molecular transitions have not been explored on a broad . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint phylogenomic scale. Tracing these transitions at the genomic-level requires mapping whole genome data to a robust phylogenetic framework. Here we explore the evolution of arthropod genomes using a phylogeny-mapped genomic resource of 76 species representing the breath of arthropod diversity.

An Arthropod Evolution Resource
As a pilot project for the i5K initiative to sequence 5,000 arthropod genomes 6 , we sequenced and annotated the genomes of 28 arthropod species. These include a combination of species of agricultural or ecological importance, emerging laboratory models, and species occupying key positions in the arthropod phylogeny. We combined these newly sequenced genomes with those of 48 previously sequenced arthropods creating a dataset comprising 76 species representing the four extant arthropod subphyla and spanning 21 taxonomic orders. Using the OrthoDB gene orthology database 7 , we annotated 38,195 orthologous protein groups (orthogroups/gene families) among all 76 species (Fig. 1). Based on single-copy orthogroups within and between orders, we then built a phylogeny of all major arthropod lineages (Fig. 2). This phylogeny is mostly consistent with previous arthropod phylogenies 8,9 , with the exception being that we recover a monophyletic Crustacea with currently available species, rather than the generally accepted paraphyletic nature of Crustacea in respect to Hexapoda (see Methods). We reconstructed the gene content and protein domain arrangements for all 38,195 orthogroups in each of the lineages for the 76 species in the arthropod phylogeny. This resource (available at https://i5k.gitlab.io/ArthroFam/ and Table S10) forms the basis for the analyses detailed below and is an unprecedented tool for identifying and tracking genomic changes over arthropod evolutionary history.

Genomic Change Throughout Arthropod History
Evolutionary innovation can result from diverse genomic changes. New genes can arise either by duplication or, less frequently, by de novo gene evolution 10 . Genes can also be lost over time, constituting an underappreciated mechanism of evolution 11,12 . Protein domains are the basis of reusable modules for protein innovation and the rearrangement of domains to form new combinations plays an important role in molecular innovation 13 . Together, gene family expansions and contractions and protein domain rearrangements, may coincide with phenotypic innovations in arthropods. We therefore searched for signatures of such events corresponding with pivotal phenotypic shifts in the arthropod phylogeny.
Using ancestral reconstructions of gene counts (see Methods), we tracked gene family expansions and losses across the arthropod phylogeny. In all, we inferred 181,157 gene family expansions and 87,505 gene family contractions. 68,430 gene families were inferred to have gone extinct in at least one lineage, and 9,115 families emerged in different groups. The most dynamically-changing gene families encode proteins involved in functions of xenobiotic defense (cytochrome P450s, sulfotransferases), digestion (peptidases), chitin exoskeleton structure and metabolism, multiple zinc finger transcription factor types, HSP20 domain stress response, fatty acid metabolism, chemosensation, and ecdysteroid (molting hormone) metabolism (Table S14). Using the estimates of where in the phylogeny these events occurred, we can infer characteristics of . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint ancestral arthropods. For example, we identified 9,601 genes in the last insect common ancestor (LICA) and estimate ~14,700 LICA genes after correcting for unobserved gene extinctions. We reconstructed similar numbers for the ancestor of each of the six wellrepresented arthropod taxa in our sample (Table 1). Of the 9,601 genes present in LICA, we identified 147 emergent gene families (i.e., those with no close homologs in other clades) which appeared concurrently with the evolution of insects (Figs. 3A & 1 node 62,  Table S15). GO-term analysis of these 147 gene families recovered multiple key functions, including cuticle and cuticle development (suggesting changes in exoskeleton development), visual learning and behavior, pheromone and odorant binding (suggesting the ability to sense in terrestrial/aerial environments rather than in water), ion transport, neuronal activity, larval behavior, imaginal disc development, and wing morphogenesis. These emergent gene families likely allowed insects to undergo substantial diversification by expanding chemical sensing, such as an expansion in odorant binding to locate novel food sources and fine-tune species self-recognition [14][15][16] . Others, such as cuticle proteins underlying differences in exoskeleton structure, may enable optimized cuticle properties for diverse environmental niches or life history stages 17 . In contrast, the data reveal only ten gene families that arose along the ancestral lineage of the Holometabola (Fig. 3B, Table S16), implying that genes and processes required for the transition to holometabolous development, such as imaginal disc development, were already present in the hemimetabolous ancestors. This is consistent with Truman and Riddiford's model that the holometabolous insect larva corresponds to a late embryonic state in hemimetabolous insects 18 .
We identified numerous genes that emerged in specific orders of insects. Strikingly, we found 1,038 emergent gene families in the first ancestral Lepidoptera node (Fig. 3F). This node has by far the most emergent gene families, with the next highest being the node leading to the bumble bee genus Bombus with 860 emergent gene families. A recent analysis suggested whole genome duplication in the Lepidoptera lineage 19 , consistent with the high number of emergent gene families. Emergent lepidopteran gene families show enrichment for functional categories such as peptidases and odorant binding. Among the other insect orders, we find 227 emergent families in the node leading to the Hymenoptera, 205 in Coleoptera, and 156 in Diptera.
Similarly, we reconstructed the protein domain arrangements for all nodes of the arthropod phylogeny, that is, the permutations in protein domain type per (multi-domain) gene. In total, we can explain the underlying events for more than 40,000 domain arrangement changes within the arthropods. The majority of domain arrangements (48% of all observable events) were formed by a fusion of two ancestral arrangements, while the fission of an existing arrangement into two new arrangements accounts for 14% of all changes. Interestingly, 37% of observed changes can be explained by losses (either as part of an arrangement (14%) or the complete loss of a domain in a proteome (23%)), while emergence of a novel protein domain is a very rare event, comprising only 1% of total events.
We observe high concordance between rates of gene family dynamics and protein domain rearrangement (Figs. 4 & S27). In some cases, we find specific examples of overlap between gene family and protein domain evolution. For example, spiders have . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint the characteristic ability to spin silk and are venomous. Correspondingly, we identify ten gene families associated with venom or silk production that are rapidly expanding within the Araneae (spiders) clade (Table S17). In parallel, we find a high rate of newly emerged protein domains in the subphylum Chelicerata, including a large number within the Araneae that are associated with venom and silk production. For example, 'spider silk protein 1' (Pfam ID: PF16763), 'Major ampullate spidroin 1 and 2' (PF11260), 'Tubuliform egg casing silk strands structural domain' (PF12042) and 'Toxin with inhibitor cystine knot ICK or Knottin scaffold' (PF10530) are all domains that emerged within the spider clade. Venom domains also emerged in other venomous chelicerates, such as the bark scorpion, Centruroides sculpturatus.
We identified other gene family changes that may underlie unique phenotypic transitions. The evolution of eusociality among three groups in our study, bees and ants (both Hymenoptera), and termites (Blattodea), requires these insects to be able to recognize other individuals of their colony (such as nest mates of the same or different caste), or invading individuals (predators, slave-makers and hosts) for effective coordination. We find 41 functional terms enriched for gene family changes in all three groups, with multiple gene family gains related to olfactory reception and odorant binding (Table S18) in agreement with previous chemoreceptor studies of these species 20,21 .
Finally, we observe species-specific gene family expansions that suggest biological functions under selection. The German cockroach Blattella germanica, a pervasive tenant in human dwellings across the world, has experienced the highest number of rapidly evolving gene families among the arthropods studied here, in agreement with a previously reported major expansion of chemosensory genes 22 . We also find the largest number of domain rearrangement events at the terminal branch of B. germanica. The impressive capability of this cockroach to quickly adapt to changing environments could be linked to these numerous and rapid evolutionary changes at the genomic level and warrants more detailed investigation.

Evolutionary Rates Within Arthropod History
The rate of genomic change can reflect key events during evolution along a phylogenic lineage. Faster rates might imply small population sizes or strong selective pressure, possibly indicative of rapid adaptive radiations, and slower rates may indicate stasis. Studying rates of change requires a time-calibrated phylogeny. For this, we used 22 fossil calibration points 8,23 and obtained branch lengths for our phylogeny in millions of years (My) (Fig. 2) that are very similar to those obtained by Misof et al. 8 and Rota-Stabelli et al. 9 .
We examined the rates of three types of mutational events: (i) amino acid substitutions, (ii) gene duplications and gene losses, and (iii) protein domain rearrangements, emergence, and loss. While clearly not changing in a clock-like manner, all types of events have a strikingly small amount of variation in rate among the investigated species (Fig. 4). We estimate an average amino acid substitution rate of 2.54 x 10 -3 substitutions per site per My with a standard deviation of 1.11 x 10 -3 . The slowest rate is found in the branch leading to the insect order Blattodea (cockroaches and . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint termites), while the fastest rates are found along the short branches during the early diversification of Holometabola, suggesting a period of rapid evolution, a pattern similar to that found for amino acid sequence evolution during the Cambrian explosion 24 . Other branches with elevated amino acid divergence rates include those leading to the Acarina (mites), and the overall path to the Diptera (flies).
Rates of gene duplication and loss show remarkably little variation, both across the tree and within the six multi-species orders (Table S12). Overall, we estimate an average rate of 43.0 gains/losses per My, but with a high standard deviation of 59.0 that is driven by a few lineages with greatly accelerated rates. Specifically, the terminal branches leading to the leafcutter ants Atta cephalotes and Acromyrmex echinatior have exceptionally high gene gain/loss rates of 266 and 277 per My, an order of magnitude higher than average, as previously reported 25 . Removing these and two other outliers, the average becomes 27.2 gains/losses per My (SD 19.7). Interestingly, the high gain/loss rates observed in these ants, in contrast to other arthropods, are not due to large gene content change in a small number of gene families. They are instead due mostly to single gene gains or losses in a large number of gene families.
Regarding protein domain rearrangements, which mainly arise from duplication, fusion and terminal losses of domains 26 , we estimate an average rate of 5.27 events per My, approximately eight-fold lower than the rate of gene gain/loss. Interestingly, we discovered a strong correlation between rates of gene gain/loss and domain rearrangement (Figs. 4 & S27) even though these processes follow largely from different underlying genetic events. For example, terminal branches within the Hymenoptera have an accelerated rate of domain rearrangement, which coincides with the increased rate of gene gains and losses observed along those branches.
Although amino acid substitutions are mediated by distinctly different processes than gene gain, loss, and rearrangement, it is possible that similar life-history traits or similar selective regimes could have led to parallel acceleration or deceleration of these different types of genomic change. However, our examination of rates of variation in all types of change among lineages found no correlation between variation in amino acid substitution rates and rates of gene gain/loss (Figs. 4 & S27). Branches with accelerated rates of amino acid substitution, such as the lineage leading to the most recent common ancestor of the insect superorder Holometabola, do not show corresponding increases in gene gain/loss rates. Similarly, the hymenopteran lineages displaying the fastest rate of gene gain/loss in our analysis do not display higher rates of amino acid substitutions.
We observe accelerated rates of amino acid substitutions in several deep branches dating to 300-600 Mya. Descendants from those branches display accelerated rates of gene gain/loss and domain rearrangement. This suggests that, in some lineages, accelerated amino acid substitution rates may lay the foundation for subsequent genomic changes. Interestingly, we did not record a corresponding increase in gene gain/loss or domain rearrangement with the increased amino acid substitution rates, instead, we observed a generally steady rate. In Hymenoptera specifically, we again observe accelerated rates of amino acid substitutions in lineages preceding increased rates of gene gain/loss within the last 100 My.
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint

Methylation Signals in Arthropod Genomes
In addition to records of gene content and protein domain change in the genome, we also searched for signals of gene regulation. DNA methylation is an important epigenetic modification that affects gene regulation in widely divergent organisms 27 . We find several independent losses of the DNA methylation machinery across the arthropods (Fig.  S7) 28 . This indicates that DNA methylation is not universally necessary for development and that the DNA methyltransfereases in insects may function in ways not previously appreciated 29 . Additionally, putative levels of DNA methylation vary considerably across arthropod species (Figs. S7, S8). Notably, the hemimetabolous insects and non-insect arthropods show higher levels of DNA methylation than the holometabolous insects 28 . Araneae (spiders), in particular, show clear bimodal patterns of methylation (Fig. S8); where some genes are highly methylated and others are not. This pattern is also found in some holometabolous insects, suggesting that the division of genes into methylated and unmethylated categories is a relatively ancient trait in Arthropoda, although many species have since lost this clear distinction. Finally, some taxa, particularly in Hymenoptera, show higher levels of CpG di-nucleotides than expected by chance alone, which may be a signal of strong effects of gene conversion in the genome 30 .

Concluding remarks and future directions
The i5K pilot initiative has assembled an unparalleled genomic dataset for research on arthropods and conducted a detailed phylogenetic analysis of evolutionary changes at the genomic level within this diverse and fascinating phylum. The combined research output of this species-level i5K work has been substantial and wide-ranging, addressing pests of agricultural crops 31,32 and animals 33 , urban 20,34 and forest 35 pests, biocontrol species 36 , along with developmental models 17,37,38 , indicators of water quality and models for toxicology 14,39 .
Here, in contrast, we take a broad overview generating a comparative genomics resource for a phylum with an evolutionary history of over 500 million years. Our analysis identifies multiple broad patterns such as the very small number of novel protein domains, and a surprising lack of variation in the rates of some types of genomic change. We pinpoint the origin of specific gene families and trace key transitions during which specific gene families or protein domains have undergone rapid radiations or contractions. Nonetheless, drawing functional biological conclusions from these data is not straightforward. In some cases, the link between specific gene families and their biological function is clear. This is true for genes related to specific physiological function (e.g., olfaction), or to the production of specific compounds (e.g., silk or venom). However, for many gene families, there is no known function, highlighting the need for functional genomic studies. This is true both for emergent gene families, such as those identified in the Lepidoptera which cannot be studied in the dipteran Drosophila model, and for rapidly evolving and diverging gene families.
A key result is that major morphological transitions cannot be assigned to particular genomic changes. Indeed, there is no genomic signature for morphological . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint evolution, and no novel genes or significant gene expansions or domain rearrangements can be linked to novel morphological structures or to major events in body plan evolution, such as the evolution of wings or holometabolous metamorphosis. This supports the emerging understanding that morphology is effected by complex gene networks, which are active mostly during ontogenetic processes 40 , rather than by individual "morphology genes". Morphological innovations are often based on modulating the timing and location of expression, rewiring of existing gene networks, and assembling new networks using existing developmental toolkit genes 41 .
The advent of affordable and widely transferable genomics opens up many avenues for evolutionary analyses. The genome is both the substrate and record of evolutionary change, and it encodes these changes, but the connection is far from simple. A better understanding of the genotype-phenotype map requires in-depth experimental studies to test hypotheses generated by genomic analyses, such as those presented here. The diversity of arthropods provides unparalleled taxonomic resolution for phenotypic change, which, combined with the experimental tractability of many arthropods, suggests a productive area of future research using and building upon the resource established herein.

Sequencing, assembly and annotation
Twenty-eight arthropod species were sequenced using Illumina short read technology. In total, 126 short read libraries were generated and sequenced to generate 4,883 Gbp of raw sequence (Table S2). For individual species, reads were assembled using AllpathsLG 42,43 followed by refinements employing Atlas-Link 44 and Gapfill 45 . Version 1.0 assemblies had minimum, mean, and maximum scaffold N50 lengths of 13.8 kbp, 1.0 Mbp and 7.1 Mbp (Table S3). Following re-assembly and collapsing of unassembled haplotypes using Redundans 46 , version 2.0. assemblies had minimum, mean and maximum contig N50 lengths of 11.1, 166.2 and 857.0 kbp with a mean scaffold N50 lengths of 619 kbp (Table S3).
To support the annotation, RNAseq data were generated from 25 species for which no data were available (Table S4). A MAKER 47 based automated annotation pipeline was applied to the 1.0 assembly of each species with species-specific input RNAseq data and alignment data from a non-redundant metazoan protein sequence set containing all available arthropod protein sequences (see Supplementary methods). This pipeline was applied to 28 species with annotatable genome assemblies generating 533,636 gene models, with minimum, mean, and maximum gene model numbers of 10,901, 19,058, and 33,019 per species (Table S5, see Table S6 for completeness statistics). Many of these gene models were manually curated using the i5k Workspace@NAL 48 . Given the magnitude of this manual task, the greatest fraction of gene models manually confirmed for a species was 15 %. The analyses presented here were performed on the automatically generated gene models.
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint

Orthology prediction
Orthology delineation is a cornerstone of comparative genomics, offering qualified hypotheses on gene function by identifying "equivalent" genes in different species. We used the OrthoDB 7 (www.orthodb.org) orthology delineation process that is based on the clustering of best reciprocal hits (BRHs) of genes between all pairs of species. Clustering proceeds first by triangulating all BRHs and then subsequently adding in-paralogous groups and singletons to build clusters of orthologous genes. Each of these orthologous groups represent all descendants of a single gene present in the genome of the last common ancestor of all the species considered for clustering 49 .
The orthology datasets computed for the analyses of the 28 i5K pilot species, together with existing sequenced and annotated arthropod genomes were compiled from OrthoDB v8 50 , which comprises 87 arthropods and an additional 86 other metazoans (including 61 vertebrates). Orthology clustering at OrthoDB included ten of the i5K pilot species (Anoplophora glabripennis, Athalia rosae, Ceratitis capitata, Cimex lectularius, Ephemera danica, Frankliniella occidentalis, Ladona fulva, Leptinotarsa decemlineata, Orussus abietinus, Trichogramma pretiosum). The remaining 18 i5K pilot species were subsequently mapped to OrthoDB v8 ortholog groups at several major nodes of the metazoan phylogeny. Orthology mapping proceeds by the same steps as for BRH clustering, but where existing ortholog groups are only permitted to accept new members, i.e., the genes from species being mapped are allowed to join existing groups if the BRH criteria are met. The resulting ortholog groups of clustered and mapped genes were filtered to select all groups with orthologs from at least two species from the full set of 76 arthropods, as well as retaining all orthologs from any of 13 selected outgroup species for a total of 47,281 metazoan groups with orthologs from 89 species. Mapping was also performed for the relevant species at the following nodes of the phylogeny: Arthropoda (38,195 groups, 76 species); Insecta (37,079 groups, 63 species); Endopterygota (34,614 groups, 48 species); Arachnida (8,806 groups, 8 species); Hemiptera (8,692 groups, 7 species); Hymenoptera (21,148 groups, 24 species); Coleoptera (12,365 groups, 6 species); and Diptera (17,701, 14 species). All identified BRHs, protein sequence alignment results, and orthologous group classifications were made available for downstream analyses: http://ezmeta.unige.ch/i5k.

Arthropod phylogeny
We reconstructed the arthropod phylogeny (Fig. 2) using protein sequences from the 76 genomes. Six different phylogenetic reconstruction approaches generated a consistent relationship among the orders (see Supplemental Methods), corresponding with previously inferred arthropod phylogenies 8,9 . However, our analysis placed the Psocodea, represented here by the human body louse, Pediculus humanus, with low bootstrap support as a sister to Thysanoptera and Hemiptera, contradicting a recent report placing this order sister to Holometabola 8 .
Of the six orders in our dataset represented by multiple species (Figs. S11-16), relationships within the Araneae, Hemiptera, Coleoptera, and Lepidoptera were identical, regardless of the tree building method used. Within the Hymenoptera, the only disagreement between methods concerned the position of the parasitoid wasps within the . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint Chalcidoidea, with three methods placing Copidosoma floridanum as sister to Nasonia vitripennis (in agreement with recent phylogenomic research 51 ), and three sister to Trichogramma pretiosum (Fig. S13). Within the Diptera, we obtained a sister group relationship between the sand fly, Lutzomyia longipalpis, and the Culicidae, but this was not a stable topology across methods (Fig. S16).
The most contentious nodes in the phylogeny involve the relationship of crustaceans and hexapods. Here, we recover a monophyletic Crustacea that represent the sister clade to Hexapoda (Fig. 2), in contrast to recent analyses suggesting this group is paraphyletic, with Hexapoda as an in-group, and sister to the remipedes and branchiopods (Cladocera) 52 . However, an extensive phylogenetic investigation (Supplementary Results, Fig. S9) shows that regardless of the inference method used, the relationships among the crustacean and hexapod lineages remain uncertain. Aside from these few discrepancies, nodal support values across the tree were high for all tree building methods used. Even when bootstrap support was < 100 %, all methods still inferred the same topology among the species included. Most importantly, remipedes are missing from our taxon sampling, as are ostracods, pentatomids and mystacocarids.

Divergence time estimation
Phylogenetic branch lengths calibrated in terms of absolute time are required to study rates of evolution and to reconstruct ancestral gene counts. We used a nonparametric method of tree smoothing implemented in the software r8s 53 to estimate these divergence times. Fossil calibrations are required to scale the smoothed tree by absolute time. We relied on Wolfe et al.'s 23 aggregation of deep arthropod fossils with additional recent fossils used by Misof et al. 8 (Table S13). The results indicate that the first split within arthropods (the chelicerate-mandibulate split) occurred ~570 million years ago (mya). We estimate that within the chelicerates, arachnids radiated from a common ancestor ~500 mya. Within the mandibulates, myriapods split from other mandibulates ~570 mya. Crustaceans started radiating ~506 mya, and insects started radiating ~430 mya.

Substitution rate estimation
To estimate substitution rates per year on each lineage of the arthropod phylogeny, we divided the expected number of substitutions (the branch lengths in the unsmoothed tree) by the estimated divergence times (the branch lengths in the smoothed tree) (Fig. 4).

Gene family analysis
With the 38,195 orthogroups and the ultrametric phylogeny we were able to perform the largest gene family analysis of any group of taxa to date. In this analysis we were able to estimate gene turnover rates (λ) for the six multi-species taxonomic orders, to infer ancestral gene counts for each taxonomic family on each node of the tree, and to estimate gene gain/loss rates for each lineage of the arthropod phylogeny. The size of the dataset and the depth of the tree required several methods to be utilized.
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint Gene turnover rates (λ) for the six multi-species orders were estimated with CAFE 3.0, a likelihood method for gene-family analysis 54 . This version of CAFE is able to estimate the amount of assembly and annotation error (ε) present in the input data using a distribution across the observed gene family counts and a pseudo-likelihood search. CAFE is then able to correct for this error and obtain a more accurate estimate of λ (Table  S12). However, with such deep divergence times of some orders, estimates of ε may not be accurate. CAFE has a built-in method to assess significance of changes along a lineage, given an estimated λ, and this method was used to identify rapidly evolving families within each order. We partitioned the full dataset of 38,195 orthogroups for each order such that taxa not in the order were excluded for each family and only families that had genes in a given order were included in the analysis. This led to the counts of gene families presented in Table S11.
For nodes with deeper divergence times across Arthropoda, likelihood methods to reconstruct ancestral gene counts such as CAFE, become inaccurate. Instead, a parsimony method was used to infer these gene counts across all 38,195 orthogroups 55 . Parsimony methods for gene family analysis do not include ways to assess significant changes in gene family size along a lineage. Hence, we performed a simple statistical test procedure for each branch to assess whether a given gene family was changing significantly: under a stochastic birth-death process of gene family evolution, and within a given family, the expected relationship between any node and its direct ancestor is that no change will have occurred. Therefore, we took all differences between nodes and their direct descendants in a family and compared them to a one-to-one linear regression. If any of the points differ from this one-to-one line by more than two standard deviations of the variance within the family, it was considered a significant change and that family is rapidly evolving along that lineage. Rates of gene gain and loss were estimated in a similar fashion to substitution rates. We counted the number of gene families inferred to be changing along each lineage and divided that by the estimated divergence time of that lineage (Fig. 4).
To estimate ancestral gene content (i.e., the number of genes at any given node in the tree), we had to correct for gene losses that are impossible to infer given the present data. To do this, we first regressed the number of genes at each internal node with the split time of that node and noticed the expected negative correlation of gene count and time (Fig. S5) (r 2 =0.37; P=4.1 x 10 -9 ). We then took the predicted value at time 0 (present day) as the number of expected genes if no unobserved gene loss occurs along any lineage, and shifted the gene count of each node so that the residuals from the regression matched the residuals of the 0 value.

Protein domain evolution analysis
We annotated the proteomes of all 76 arthropod species and 13 outgroup species with protein domains from the Pfam database (v30) 56 . Thereby, every protein was represented as a domain arrangement, defined by its order of domains in the amino acid sequence. To prevent evaluating different isoforms of proteins as additional rearrangement events, we removed all but the longest isoform. Repeats of a same domain were collapsed to one instance of the domain (A-B-B-B-C → A-B-C), since copy numbers of some repeated . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint domains can vary strongly even between closely related species 57,58 . To be able to infer all rearrangement events over evolutionary time, we reconstructed the ancestral domain content of all inner nodes in the phylogenetic tree via the DomRates tool (http://domainworld.uni-muenster.de/programs/domrates/) based on a combined parsimony approach (see Supplementary Methods). Six different event types were considered in this study (Fig. S6): fusion, fission, terminal loss/emergence and single domain loss/emergence. For the rate calculation just all arrangement changes were considered that could be explained by exactly one of these event types, while all arrangements were ignored that could not be explained by one of these events in a single step or if multiple events could explain a new arrangement.
. CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018.   The bars show Metazoa-level orthologs for the 76 selected arthropods and three outgroup species (of 13 outgroup species used for analysis) partitioned according to their presence and copy-number, sorted from the largest total gene counts to the smallest. The 28 i5K species generated in this study with a total of 533,636 gene models are indicated in bold green font with species images positioned above each bar. A total of 38,195 orthologous protein groups were annotated among the total 76 genomes.  Table S10 for full node labels. We identify 147 gene families emerging during the evolution of insects, including several which may play an important role in insect development and adaptation. B. Contrastingly, we find only 10 emergent gene families during the evolution of holometabolous insects, indicating many gene families were already present during this transition. C. Among all lineage nodes, we find that the node leading to Lepidoptera has the most emergent gene families. D. leafcutter ants have experienced high rates of both gene gain/loss and protein domain rearrangement. E. Blattella germanica has experienced the highest number of rapid gene family changes, possibly indicating its ability to rapidly adapt to new environments. F. We observe signals of methylation patterns in all Aranae (spiders) genomes investigated (species shown: The Brown Recluse spider, Loxosceles reclusa) and the genome of the bark scorpion Centruroides exilicauda. The two peaks show different CG counts in different gene features, with depletion of CG sequences in the left peak due to methylated C's mutating to T. This suggests epigenetic control of a significant number of spider genes. Additional plots for all species in this study are shown in Fig. S8. . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint    . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 4, 2018. ; https://doi.org/10.1101/382945 doi: bioRxiv preprint