Molecular evolutionary trends and feeding ecology diversification in the Hemiptera, anchored by the milkweed bug genome

Background The Hemiptera (aphids, cicadas, and true bugs) are a key insect order, with high diversity for feeding ecology and excellent experimental tractability for molecular genetics. Building upon recent sequencing of hemipteran pests such as phloem-feeding aphids and blood-feeding bed bugs, we present the genome sequence and comparative analyses centered on the milkweed bug Oncopeltus fasciatus, a seed feeder of the family Lygaeidae. Results The 926-Mb Oncopeltus genome is well represented by the current assembly and official gene set. We use our genomic and RNA-seq data not only to characterize the protein-coding gene repertoire and perform isoform-specific RNAi, but also to elucidate patterns of molecular evolution and physiology. We find ongoing, lineage-specific expansion and diversification of repressive C2H2 zinc finger proteins. The discovery of intron gain and turnover specific to the Hemiptera also prompted the evaluation of lineage and genome size as predictors of gene structure evolution. Furthermore, we identify enzymatic gains and losses that correlate with feeding biology, particularly for reductions associated with derived, fluid nutrition feeding. Conclusions With the milkweed bug, we now have a critical mass of sequenced species for a hemimetabolous insect order and close outgroup to the Holometabola, substantially improving the diversity of insect genomics. We thereby define commonalities among the Hemiptera and delve into how hemipteran genomes reflect distinct feeding ecologies. Given Oncopeltus’s strength as an experimental model, these new sequence resources bolster the foundation for molecular research and highlight technical considerations for the analysis of medium-sized invertebrate genomes. Electronic supplementary material The online version of this article (10.1186/s13059-019-1660-0) contains supplementary material, which is available to authorized users.


Library preparation
To prepare the 180-bp and 500-bp libraries, we used a gel-cut paired end library protocol. Briefly, 1 µg of the DNA was sheared using a Covaris S Quantification and size distribution of the final library was determined before sequencing as described above.

Sequencing
Sequencing was performed on Illumina HiSeq2000s generating 100-bp paired end reads. Reads were pre-processed using cutadapt for adapter removal and sickle-trim for quality trimming (-min length 20bp). Subsequently, reads were assembled using ALLPATHS-LG (v35218) [1] on a large memory computer with 1 TB of RAM and further scaffolded and gap-filled using in-house tools Atlas-Link (v.1.0) and Atlas gap-fill (v.2.2) (https://www.hgsc.bcm.edu/software/). This yielded an assembly of 1.099 Gb (774 Mb without gaps within scaffolds), with a contig N50 of 4.0 kb and scaffold N50 of 340 kb, which has been deposited in GenBank (assembly accession GCA_000696205.1).

2.1.a Flow cytometry estimation
Contributors  maculatus standard. Both standards gave the genome size estimates shown, but the standard error (shown) was lower using estimates based on the C. maculatus standard.
The genome size of the lab and wild strains are not significantly different.
However, the male genome is very slightly larger than the female (Table S 2.1). The larger genome size estimate for the male is consistent with a large neo X/Y in the male. Oncopeltus has 2n = 16 chromosomes and an XX/XY type of sex determination; the Y is largely heterochromatic and transcriptionally inactive, pairing only briefly with the X in meiosis [3]. The first peak is the D. virilis 2C. The small second peak is the D. virilis 4C, the large third peak is the 2C of Oncopeltus, the 4th peak is the C. maculatus 2C peak, and the two smaller right-most peaks are the 4C of Oncopeltus and the 4C of C. maculatus, respectively.

Contributors: Iris M. Vargas Jentzsch and Kristen A. Panfilio
With the aim of estimating genome size, heterozygosity, and repeat content from unassembled sequencing reads, we tried several approaches to characterize the k-mer frequency spectrum in the Oncopeltus genomic dataset. As starting point we had four 100-bp Illumina read libraries: two mate-pair and two paired-end (see Table S   We used the programs Jellyfish2.1.4 [4] and bbmap [5], to perform k-mer counts on each sequencing library separately, for a range of k between 15 and 35. The program bbmap was initially used only to confirm the Jellyfish results, but proved more efficient by generating the same results, and even extending these over a higher range of k-mer depths (up to 100,000 with bbmap vs. up to 10,000 with Jellyfish).
Counts for k > 30 could only be completed with bbmap.

K-mer frequency spectra in Oncopeltus
To generate the k-mer frequency spectrum, the frequency of occurrence of k-mers in the dataset (also called depth or multiplicity) is plotted against the observed counts for each of the frequencies. (Note that there is no consensus in the naming of axes for the k-mer frequency plots, and because we are actually plotting the frequency of a frequency, either axis can be labeled as 'frequency'.) In an ideal dataset, with no sequencing errors and where all parts of the genome are represented equally, we expect a curve with one or more clear peaks, with the highest peak corresponding to the homozygous non-repetitive fraction of the genome. This peak would ideally be centered at the expected depth of coverage of the respective dataset (or more precisely at expected coverage -1/read length*k -1 ), because unique regions of the genome should be sampled on average as many times as the sequencing coverage [see 6,7,8].
In practice, k-mers from repetitive regions of the genome and k-mers containing sequencing errors can shift or change the shape of this curve.
The k-mer spectra obtained for our datasets did not resemble theoretical expectations [like in 9] in that they had very shallow or non-existent peaks. Excluding k-mers with counts <5, which represent mostly erroneous k-mers [10], allowed the visualization of small peaks for all but the 500-bp dataset. All k-mer spectra distributions were unimodal, with a long, slowly decreasing tail towards higher k-mer frequencies. Among the four datasets, the most prominent peaks were observed in the mate-pair libraries ( Figure S 2.2). The 500-bp dataset showed an almost monotonic exponential decay for all measured k-mer values, probably due to its low coverage (13%). However, the shape of the 500-bp dataset did not improve when combined with counts from the 180-bp dataset.
For each dataset, the position of the peak (frequency or depth at which the peak is centered) varied depending on the size of the k-mer counted: in all cases increasing k-mer lengths produced a shift of the peak towards lower k-mer frequencies (Figure S 2.3 shows the progression for the 180-bp library). This kind of shift can be due to poor data quality or bias in sequencing coverage [9,10]. In the case of sequencing errors, these become magnified by a factor of k because there will be k erroneous k-mers per error, and erroneous k-mers are expected to have very low frequencies [11]. This magnification will shift total k-mer counts towards very low frequencies and reduce the difference between this noisy region and the remaining curve. The second factor, sequencing coverage bias, is typical in amplification-based sequencing approaches due to variation in amplification efficiency across the genome.
This affects the probabilities of sampling k-mers in the same genomic frequency class, Figure S 2.2: Overlap of 17-mer spectra for the four Illumina libraries. The 8-10Kb mate pair library had the peak with the highest volume, but centered at a depth of 14, followed by the 3Kb peak at a depth of 20 and the 180bp peak centered at a depth of 23. The 500 bp library has a very shallow peak. All the peak depths were lower than the expected depth of coverage for the respective libraries (see Table S 2.2). Figure S 2.3: k-mer spectra for the 180 bp library and variable values of k. The display is limited to depth values from 4 to 100, to better visualize the 'peaks'. For increasing k-mer sizes, the peak becomes more pronounced and shifted towards lower k-mer frequencies, whereby the inflexion in the curve separating erroneous k-mers from the rest of the distribution occurs at higher counts. This means that, the larger the k-mer size, the less distinction there is between correct and erroneous k-mers. flattening the k-mer frequency curve and making it more difficult to observe a peak [10]. Our datasets had very good quality as inferred from Illumina quality scores, and erroneous k-mers are usually dealt with by excluding low frequency k-mers from the calculations. On the other hand, coverage bias is more difficult to deal with and requires complex modeling [see 10], which we did not do here. Unimodal distributions with peaks shifted towards lower depths were suggested to be characteristic for most animals except mammals and birds [12,13], but these previous analyses were limited to Drosophila and Apis mellifera for the insects.
The presence of only a single peak in all our k-mer spectra impeded the estimation of heterozygosity. Expectations from ideal simulated data for a diploid genome show that the k-mers corresponding to the heterozygous part of the genome would form a second peak at about half the expected depth of coverage of the given dataset. The ratio between these two peaks depends on the heterozygosity of the genome, therefore allowing its estimation [see 10]. For Oncopeltus we also expected a bimodal distribution because the individual sequenced was not inbred. However, several other arthropods sequenced as part of the i5K pilot project showed two distinct peaks in their k-mer spectra (S.C. Murali and S. Richards, manuscript in preparation), suggesting that the relative flatness of our k-mer spectra may indeed be due to specific characteristics of the Oncopeltus genome.

Genome size estimation
We estimated genome size from all obtained k-mer spectra using two formulas. One is based simply on k-mer numbers: total k-mer number/coverage at the peak (formula 1) [14], while the second uses nucleotide counts from the estimate-genome-size.pl script by J. Ryan (formula 2) (https://github.com/josephryan/estimate_genome_size.pl): For all estimations we excluded the low depth k-mers forming the left side sharp peak in the k-mer spectra curves; the threshold was set at the lowest turning point of each curve, which varied between depth values of 4 and 6. The estimates from our calculations (Table S 2.3) were in most cases too high compared to the genome size estimated by flow cytometry of 920 to 930 Mb (see Section 2.1.a). The best approximation was obtained from the 15-mer spectra, because this curve had the peak at the highest depth of coverage. This depth of coverage (or k-mer frequency) at the peak was the main factor affecting the estimations; it became smaller for larger values of k, resulting in an overestimation of genome size. As mentioned above, it is probably due to sequencing coverage bias that all our sequencing libraries had the peak of the k-mer spectra centered at much lower depth of coverage values than the library sequencing depth. Consequently, about one third of the genomic k-mers were indistinguishable from erroneous low frequency k-mers, precluding more accurate estimations from the k-mer frequency spectra. LGT candidates For Oncopeltus fasciatus, 20 lateral gene transfer (LGT) candidates from bacteria were predicted using computational methods and assessed by subsequent manual annotation. These have a blastn similarity score to prokaryotes <1e-10 or a bitscore >75. We then assessed these LGT candidates, by several criteria, including information on the possible prokaryotic source of the LGT, gene annotation, and expression support (  . Alanine racemase is an enzyme that catalyzes a structural conversion from L-alanine to D-alanine, with D-alanine being used in murein biosynthesis in bacteria. Although alanine racemase has been found in some marine invertebrates [15], the LGT found in Oncopeltus is phylogenetically embedded among alanine racemase genes from bacteria (based both on nucleotide and protein analyses: Figure   S   Whereas the two LGTs with potential roles in cell wall synthesis appear to be unique to Oncopeltus among the insects, an older LGT event led to the introduction of the cell wall degradation enzyme endo-1,4-beta-mannosidase [16] in the common ancestor of Oncopeltus and the stink bug Halyomorpha halys, a fellow member of the hemipteran infraorder Pentatomomorpha (Figure S 2.6, and see main text). Finally, although not detected by our DNA-based LGT pipeline, we also found in Oncopeltus the ancient lysozyme glycoside hydrolase (GH25) LGT of Wolbachia origin, which is present in other hemipterans [17]. GH25 (also known as muramidase) is implicated in antimicrobial defense, and encodes an enzyme that also breaks down the bacterial cell wall. Therefore, we have detected various LGTs in Oncopeltus that are implicated in bacterial or plant cell-wall metabolism, and range in age from relatively recent (Oncopeltus specific), to presence in a subset of Hemiptera, to an LGT of apparently ancient origin in the Hemiptera.

Potential bacterial scaffolds
In addition to LGT candidates, there were five candidate "contaminating" bacterial scaffolds identified. Typically we find scaffolds that are probably from bacteria and these most likely are from bacterial associates of the insect. Thus, they can be informative. The identified scaffolds have homology to Sphingomonas sanxanigenens  (919 to 3841 bp), with stretches of only 0.9 to 1.4 Kb each with strong blast similarity to the bacterial sequence. Furthermore, these scaffolds do not represent entire bacterial genomes, and each has highest homology to a different bacterial species. A caveat is that we cannot always rule out that these "bacterial" scaffolds are actually large LGTs. Future comparison of the depth of coverage on these scaffolds to depth of coverage in genomic scaffolds (single copy regions, not TEs) could clarify this, as bacterial contaminants are typically at different density from nuclear genes (sometimes much higher, sometimes much lower). Note that the overall low number of such "contaminating" scaffolds is consistent with the method of template preparation for sequencing, as DNA was prepared from dissected adults from which gut material was removed. This also further strengthens the likely validity of the strong LGT predictions.

Methods
Two different computational pipeline scripts were used to identify LGT candidates and contaminating bacterial scaffolds in the genome assembly.

Information on the "old" computational pipeline
The scaffold assemblies (Genome File Name: Ofas.contaminationfree.scaffolds, downloaded from the i5k FTP site on 6-10-15) were run through this pipeline [18], which compares sequences for matches to ~1000 different bacterial genomes and compares the blast matches to a set of up to 9 reference eukaryotic genomes. For O.
fasciatus, the animal database contained transcripts from the following animal genera: Anopheles, Drosophila, Xenopus, Tribolium, Daphnia, Mus, and Homo sapiens.
Scaffold outputs were then sorted into potential LGTs, contaminating bacterial scaffolds or likely conserved scaffolds. Sorting was based on length of the scaffold, length of bacterial matches across the scaffold, and difference score in bacterial blast match relative to eukaryotic match [18]. This pipeline focuses on the best blastn bacterial hit (highest e-value) on each scaffold, and therefore additional regions with bacterial similarity were found manually using NCBI blastn.

Information on the "new" computational pipeline
Since the old pipeline sorted scaffolds based on only the top prokaryotic hit information, there was concern that LGT candidates were being missed if they were not the best hit. The "new" computational pipeline breaks long scaffolds into 1000-bp intervals and searches each of them against the bacterial database. Any positive hits of the 1000-bp regions were then searched against the animal database. The bacterial database contained about 1000 bacterial species and was masked for low complexity regions using the NCBI Dustmasker function. The animal database for O. fasciatus contained transcripts from a representative from each of the following animal genera: Anopheles, Drosophila, Xenopus, Tribolium, Daphnia, Strongylocentrotus, Mus, Homo sapiens, Aplysia, Caenorhabditis, Hydra, Monosiga, and Acanthamoeba. The significance e-value cut-off used was <1e-5 for both the animal and bacterial hits.
Regions of bacterial similarity that fell from the end of one 1000-bp interval to the adjacent interval were joined. Only the putative LGT regions ≥100 bp and without any hits to the animal database were used in the final analysis.

Manual annotation
The candidate LGT outputs were then manually curated by the following basic steps.

Potential
LGT regions on the scaffolds were searched via blastn (NCBI) in the nr/nt database. If this indicated that the region is simply a conserved gene in insects and other metazoan organisms, it was noted as conserved and disregarded. If not, the region was searched via blastx to the nr/nt database to determine if it is a conserved gene or remains an LGT candidate. For LGT candidates, flanking genes were analyzed by both blastn/blastx of the flanking sequence and/or by observations of gene models and transcription models on the species-specific Apollo web browser (https://apollo.nal.usda.gov/oncfas/selectTrack.jsp). Outputs meeting the criteria of a potential LGT were labeled as an LGT candidate in the annotation comments.
After the OGS had been generated, further manual assessment was performed, including inspection of transcriptome and RNA-seq read evidence tracks in the genome browser, and by cross-referencing to 4 stage-specific libraries (embryo/maternal, nymph, adult female, adult male) if the LGT could be assigned to a given OGS gene model. Additional GenBank NCBI blastn and blastp analyses were also performed in light of finalized OGS gene models from community curation (Section 4, above).

Experimental validation of selected LGT candidates
PCR amplification from Oncopeltus genomic DNA (gDNA) template was used to validate five selected LGT candidates, with three biological replicates for template gDNA template. Amplification followed a published thermocycle program for LGT candidates [19] with the following modifications: annealing temperature of 54.4 ºC and a 2.5-minute extension step. Primers were designed against the genome assembly and chosen to flank the LGT region (sequences listed in Table S 2.4). The resulting amplicons (1.1-1.7 kb) were cloned, and two clones per LGT amplicon were Sanger sequenced and confirmed to fully encompass the predicted LGT sequence.

Contributors: Iris M. Vargas Jentzsch and Kristen A. Panfilio
The repeat content of the assembly was assessed with RepeatModeler [20] using default parameters, based on a species-specific repeat library generated de novo with RECON [21], RepeatScout [22], and Tandem Repeats Finder [23]. For comparative analysis, the same analysis was performed on the genome assemblies of two fellow nucleotides, the current Oncopeltus assembly is considerably fragmented. One of the main causes of fragmentation in genome assemblies is a high proportion of repeats in the assembly [24]. Thus, the analysis of repetitive content in a highly fragmented assembly is essentially flawed.
In an attempt to improve the Oncopeltus assembly, we generated additional We first attempted to generate a hybrid assembly from PacBio and Illumina reads, using ALLPATHS-LG. At the time of writing this paper we had no success in generating a hybrid PacBio-Illumina assembly, due to excessive requirements of RAM by the program (>500 Gb). Nevertheless, initial ALLPATHS-LG estimates of total repetitive content placed the value at 68% with these hybrid data.
Thus, we also used the PacBio subreads to perform gap-filling on the i5k scaffolds with PBJelly Version: 13.10.22 [26] with the following blastr parameters:  [27]. Therefore, there is something particularly challenging in our current Oncopeltus dataset, and we decided to do further analyses to check if any more information could be gained by including PacBio data into our assembly.
Hereafter we will call the gap-filled Illumina assembly the 'PBJelly assembly'.
For an initial quality assessment on the PBJelly assembly, we compared it to the Illumina assembly with respect to presence, completeness, and copy number for benchmarking universal single-copy orthologs of protein coding genes (BUSCO, v. 3, [28,29]) based on expectations across the Insecta. The two assemblies are highly similar for these metrics, with slight improvements for completeness at the expense of only a minor increase in duplicates in the PBJelly assembly ( and notably lower than our ALLPATHS-LG estimate of 58% (Figure S 2.7 B). The absolute coverage, however, is almost double that in the Cimex assembly and more than 2.5 times that in the pea aphid assembly (see main text Fig. 5b). The majority of the repeats found by RepeatModeler were in the "unknown" category, probably because they were too short or incomplete to be identified unambiguously. Lastly, although our approach to identify repeats in Oncopeltus was a combination of de novo and homology-based prediction approaches, sensitivity could be enhanced by further manual curation of the de novo library [31], as was done for the pea aphid [32].   repeat categories in Oncopeltus, compared between two assembly versions. "Illumina i5K" is the current official assembly based solely on Illumina short read data. "PBJelly" is the same i5K assembly after gap filling with PacBio reads (see also Supplemental Note 2.3). Even after gap filling, it was not possible to identify large tandem repeat structures of the satellites category in these analyses with RepeatModeler. (B) Total repetitive content estimations for Oncopeltus based on different genome assembly versions. In addition to the assemblies used in panel (a), a third dataset, "ALLPATHS-LG", represents an estimation derived from initial attempts to generate a hybrid assembly with Illumina and PacBio reads.

Contributors: Christopher J. Holmes and Joshua B. Benoit
To gain insight into stage-and sex-specific enrichment in gene expression, we For Oncopeltus, we further used our known ("i5K") RNA-seq datasets for an adult male, an adult female, and mixed-instar nymphs to assess a previously published dataset for an Oncopeltus adult of unspecified sex ("Andolfatto", [33]). Genes expressed in the Andolfatto set are consistent with this sample being most similar to the i5K male sample. Overall, the Andolfatto set had nearly 50% fewer differentially expressed genes in relation to the i5K male RNA-seq set when compared to either the i5K female or i5K nymph RNA-seq sets. Also, sperm-specific genes, such as serinethreonine kinases, showed a noticeably higher level of expression in the Andolfatto set even compared to the male-specific i5K set. However, vitellogenin-specific genes are also detectable in the Andolfatto set, albeit at lower levels compared to the female-specific i5K set. Thus, it appears that the Andolfatto set is mainly composed of male-specific genes, but does represent female-and nymph-specific genes as well.

Methods
Sex-specific and developmental stage-specific RNA-seq analyses were conducted according to published methods [34,35]

Contributors: Dan S.T. Hughes and Stephen Richards
Oncopeltus fasciatus was among 28 i5K pilot species for which genome assemblies were subjected to automatic gene annotation using a Maker 2.0 annotation pipeline tuned specifically for arthropods. The pipeline is designed to be systematic, providing a single consistent procedure for the species in the pilot study and scalable to handle hundreds of genome assemblies, using both protein and RNA-seq evidence to guide gene models, and targeted to utilize extant information on arthropod gene sets.
The core of the pipeline was a Maker v2.28 [36] instance, modified slightly to enable efficient running on our computational resources. The genome assembly was first subjected to de novo repeat prediction (RepeatModeler 1.0.8 [20]) and CEGMA analysis to generate gene models for initial training of the ab initio gene predictors.  Special care was required for curating genes that were only partially predicted in the assembly. Ab initio gene prediction is difficult across big gaps (when parts of the gene are on different contigs), and a complete gene prediction is often not possible if parts of the same gene are on different scaffolds. Each part of a gene split across scaffolds may have its own automated model, or some parts may lack models altogether. Split gene models are to be expected in fragmented genome assemblies, because sequence gaps make gene prediction difficult [44]. Furthermore, having multiple good blast hits for the same gene on different scaffolds could indeed be due to the gene being split across multiple scaffolds, or to either gene duplication or conserved sequence within a gene family. If the automated model to be curated only represented part of the query sequence, the search was repeated with only the missing part of the query, in order to focus the blast search on relevant local alignment regions. If this query sequence was very short, turning off the low complexity filter in the blast options increased the probability of getting a hit. For genes split across multiple scaffolds, the model for each part was checked and documented for exon start and end phases at the break points, to make sure that the proper reading frame is obtained when concatenating all parts into a single model. Lastly, split models were documented as such both in the metadata and in the model name (labeled with the suffix '-part x of y'). As the gff3 format does not provide a way to specify this information other than in the comments section, the official gene set gff files contains multiple models per split gene, and the details for putting these parts together were documented separately (see Table S 4.2).

Community curation and generating the official gene set
After the manual curation stage, the official gene set (OGS) was created by merging the computationally predicted gene set and the manual curated models. This was done using a 'patch' build system that uses heuristics to merge manual and automated gene predictions [45], whereby all automated models that overlapped on the same strand with manually curated models were replaced. This overlap was restricted to coding exons, and therefore it was important to take care that all reading frames were set correctly: if the reading frame of an upstream exon was incorrectly set and generated a premature stop codon, the rest of the gene model downstream would be registered as untranslated region and all intersecting automated annotations would be kept in the OGS. Indeed, several of these cases were registered in the OGS  (205) is far greater than the number of gene models resulting from split CDS actions (30). These numbers suggest that the original Maker dataset may be an overestimate of the actual number of genes from certain gene families, and that gene length was also underestimated. Indeed, the mean gene locus size of the OGS v1.1 is 12,985 bp (median: 8,819 bp), which is slightly higher than in the original automated set, because manually annotated genes were longer (mean: 20,794 bp; median: 11,326 bp). In an extreme example, the Oncopeltus orthologue of hemocytin (also known as hemolectin), which encodes a conserved carbohydrate binding protein of 3667 amino acids, was split across three scaffolds and predicted in ten separate automatic annotation models.
In addition, many more gene models were extended than reduced (376 vs. These trends may be due to the fragmented nature of the assemblygenome assembly fragmentation is often associated with a higher predicted gene number and fewer exons per gene [44]. Furthermore, many gene models were newly added in the curation process (285), 70% of which are chemoreceptors (gustatory receptors, ionotropic receptors, and odorant receptors, Table S1; chemoreceptors represented only 4% of all curated genes), which are fairly small, rapidly evolving, and often lack expression support, making them difficult to predict computationally [see also . 46].

Developmental regulation: transcription factors and signaling pathways
One of the main reasons for choosing to sequence the Oncopeltus genome was due to its status as an  It is already known that several key genes involved in axis formation in Drosophila are not conserved in other insects [47]. In other cases apparent gene absence may be due to genome assembly limitations, and readers are encouraged to also examine available transcriptomic resources (summarized in main text Fig. 2; [48]).

5.1.a Anterior-posterior body axis: terminal patterning system and segmentation
The terminal patterning system has been previously studied in Oncopeltus [49], and most of the relevant genes have been identified there. The genome analysis revealed a duplication of Torso-like, which encodes a perforin-like protein. Torso-like is also duplicated in the aphid genome, with one quite derived copy (Torso-like related), but the Oncopeltus duplications are independent of the aphid ones (based on Bayesian phylogenetics), and less derived. These copies have been named Torso-like 1 and Torso-like 2 to reflect a recent duplication with similar copies. The genome analysis did not recover a copy of trunk, which is involved in terminal patterning in Drosophila. However, we have recovered a copy of the closely related PTTH, which controls developmental timing of juvenile stages. In aphids and crustaceans, genes similar to trunk/PTTH are found with more similarity to vertebrate and lophotrochozoan noggin. A representative of this class of genes is also present in Oncopeltus (see below in Section 5.1.i). We have confirmed the absence of torso, reported to be missing in Weisbrod, et al. [49]. This gene encodes a receptor tyrosine kinase, of which many are present in the genome. The most closely related receptor tyrosine kinase we have found is most similar to 'Neurospecific receptor kinase'. This is a surprising result, since in all species studied so far where there is a ligand (either PTTH or trunk) there is also a receptor similar to Torso [50]. These findings suggest that Oncopeltus uses PTTH to control developmental timing (as this appears to be an ancestral trait in insects), but it appears as though this pathway is not involved in terminal patterning [49,50].
The genes known from the Drosophila segmentation cascade (maternal, gap, and pair-rule members) were examined in the Oncopeltus genome. With the exception of the cyclorrhaphan-restricted bicoid, homologs were found for all inspected candidate genes, including two copies each of knirps (previously characterized in [51]) and of paired class genes, and with expression support for two isoforms each of nanos and Giant. Interestingly, the gene reported as engrailed and used in several reports of Oncopeltus development [49,[52][53][54] turns out to be orthologous to invected, which encodes a diagnostic "RS-motif" from a single, small internal exon [55], and which we could validate empirically by amplification from cDNA. The actual engrailed ortholog occurs in a tail-to-tail orientation on the same scaffold.
Lastly, all three odd-like genes (odd-skipped (odd), brother of odd with entrails limited (bowl), and sister of odd and bowl (sob)) were identified in the

Contributors: Kristen A. Panfilio and Iris M. Vargas Jentzsch
Homeodomain transcription factors are a protein superfamily with diverse roles in developmental regulation, with the eponymous Hox genes representing a textbook case of metazoan-wide conservation in gene cluster organization, protein sequence, and role in tissue specification along the anterior-posterior body axis [56]. Between pipeline analyses of OGS v1.1 (see main text and Section 6.2, below) and manual curation, we have identified 96 genes encoding homeodomain transcription factors, of which 39 have been manually curated for their relevance to specific biological roles covered here and in other manual curation sections. For the Hox genes, we have found and annotated complete, single copy orthologs for all ten expected genes in Oncopeltus (Table S 5 The size of the complete Hox cluster, assuming a direct concatenation of these 11 scaffolds, is 4.2 Mb. This is relatively large compared to the better-assembled clusters we annotated in other i5K genomes: in both the bed bug (Cimex lectularius, [19]) and the Asian longhorned beetle (Anoplophora glabripennis, [46]) the cluster spans 3.5 Mb, assembled onto only one or two scaffolds, respectively. Both species have genome sizes comparable to Oncopeltus (926 Mb), from 865 Mb in the bed bug to 976 Mb in the Asian longhorned beetle. The increase in size of the Oncopeltus cluster is largely due to an increase in the length of intronic and intergenic regions, including gaps in the assembly. In contrast, protein sizes are marginally smaller (see Table S  For the functionally diverged Hox gene fushi tarazu (ftz), a single transcript sequence was identified in a previous transcriptome [48]. Based on empirical analyses by Yong Lu (group of Leslie Pick, unpublished data), two transcript isoforms were annotated for ftz in the current genome assembly. However, while the presumptive homeodomain open reading frame could be found, neither isoform appears to encode a complete transcript for a functional protein, and these two isoforms may rather represent degradation to pseudogene status.
Beyond the Hox genes, we were also able to identify clear orthologs in conserved syntenic pairs for the aforementioned engrailed and invected paralogs (see     [57] and Anoplophora glabripennis (Agla) [46], and the bugs Oncopeltus fasciatus (Ofas) and Cimex lectularius (Clec) [19]. Percent protein size change is relative to Tribolium.

Contributors: Jan Seibert and Kristen A. Panfilio
The transcription factors araucan (ara), caupolican (caup), and mirror (mirr) belong to the TALE superclass of homeodomain proteins [58] and form the Iroquois-Complex (Iro-C) in the fruit fly Drosophila melanogaster [59][60][61]. It is already known that ara and caup arose due to a tandem duplication of the gene iro in the drosophilid lineage [62], while the Iro-C itself is ancestral within crustaceans and insects. In vertebrates on the other hand, one can find 2-4 Irx clusters, with up to three genes per cluster. More basally branching metazoans like the cnidarians or the placozoans, but also nematodes, which as fellow Ecdysozoa are much closer to insects, have only a single Irx gene [63]. Given these deep, lineage-specific complements of Irx genes and clusters, they provide a perfect target to assay the quality of the Oncopeltus genome.
In most bilaterians, but not in vertebrates, this synteny can be extended to the ankyrin repeat-containing sosondowah (sowah) gene, which is known to be associated with the Iro-C [61,64]. In all so far investigated insect species sowah is located 5 to the Iro-C and has an opposite reading direction [63]. Our results not only support the high degree of conservation of the Iro-C and its associated genes in Oncopeltus fasciatus, they also strengthen the quality of our assembly in general with the expected linage of the Iro-C gene pair.

Methods
Iro-C and Irx protein sequences were retrieved using TBLASTN and BLASTP algorithms on the corresponding databases/genome browsers (Table S 5    Notably the support values for this grouping and already for those in the iroqouis branch starting at N. vitripennis are quite low.

Contributors: Thorsten Horn and Kristen A. Panfilio
We found 5 T-box genes in Oncopeltus based on homology to Drosophila melanogaster, Tribolium castaneum and additional species where available (Figure S 5.5). This is fewer than in Drosophila (8) and Tribolium (6). However the gene Dorsocross has three paralogs in Drosophila (Dmel-Doc1-3), which stem from a recent duplication in the Drosophila lineage and possess similar expression patterns and functions [67]. Therefore, we would only expect 6 T-box genes in Oncopeltus, similar to Tribolium. The only missing T-box gene in Oncopeltus is optomotor-blindrelated-gene-1 (org-1). We did find a partial duplication of Ofas-optomotor-blind

Contributors: Yong Lu and Leslie Pick
The   family of NRs appears to be highly conserved in hemi-and holometabolous insects.
Overall, of the 16 nuclear receptors that were found in the Oncopeltus genome (

Contributors: Lena Sachs and Siegfried Roth
The BMP pathway consists of extracellular ligands and their modulators, of transmembrane receptors with intracellular serine/threonine kinase activity and of cytoplasmic signal transducers, which relay the signal to the nucleus and regulate target genes [89]. All core components of the BMP signal pathway were found in the Oncopeltus genome (for functional analysis of key components see [90] has been well studied in vertebrates and acts as a negative regulator of BMP signaling [92]. BAMBI has been lost in the lineage leading to Drosophila, however it is present in Tribolium and Nasonia [93,94]. We found a homolog in Oncopeltus (incomplete gene model). Homologs also exist in Halyomorpha and Cimex. Modulation of the BMP receptors and SMADs occurs also at the level of regulated protein degradation.
In Drosophila the E3 ligase Smurf regulates ubiquitination and proteolysis of the BMP receptor Thickveins and/or pMAD [95,96]. A Smurf ortholog was found in Oncopeltus.
Oncopeltus possesses a typical set of extracellular modulators of BMP signaling, including: a homolog of the BMP inhibitor Short gastrulation (Sog); a homolog of Tolloid (Tld), the protease that cleaves Sog to release the BMPs from Sog-BMP complexes; Crossveinless-1 (Cv-1, also known as Twisted gastrulation (Tsg)); Crossveinless-2 (Cv-2); and Pentagone [89,97]. However, no BMP modulators of the DAN family were discovered in the genome, although two are present in Tribolium, suggesting a lineage-specific loss [92,94]. Similarly, no admp ortholog could be identified, similar to the situation in pea aphids, although one is present in the holometabolous wasp Nasonia vitripennis [93].
Interestingly, Noggin, a secreted BMP inhibitor that so far has not been identified in any of the holometabolous insect genomes [50], is present in Oncopeltus.
Within the Hemiptera Noggin homologs show considerable sequence divergence, as the Noggin proteins of the bed bug and pea aphid show only 50% and 32% similarity to Oncopeltus Noggin, respectively.

Contributors: Yen-Ta Chen and Siegfried Roth
The Toll pathway is involved in innate immunity, convergent extension and dorsal- We identified six TLRs in Oncopeltus. Toll1, Toll6, Toll7, Toll8, Toll10 were identified in both the genome and transcriptome [48] while Toll9 was only identified in the transcriptome. Toll6, Toll8 and Toll10 belong to the Loto clade of segmentally expressed Tolls [101]. Toll1 is involved in DV patterning and probably also in innate immunity [90].
As for the downstream cytoplasmic signal transducers we found one representative each for Myd88, Pelle and Tube-like kinase. The presence of the latter further supports the idea that Tube proteins fused to a kinase domain represent the ancestral state for insects, and consequently that the Tube proteins of Hymenoptera and higher Diptera lacking a kinase domain resulted from a secondary loss [103].
Surprisingly, we identified six paralogs for I-kappaB/cactus scattered throughout the genome ( Figure S 5.7), four of which have been shown to be expressed during blastoderm stages [90].  Nv-Cact2 Nv-Cact1 Am-Cact3 Am-Cact1 Am-Cact2 Aa-Cact

Of-Cact1
Of-Cact5 OfCact-6 Of-Cact4 Of-Cact2 Of-Cact3 suggesting that they have arisen by tandem duplication. Both dorsal genes contribute to DV patterning, albeit to a different degree [90].

Contributors: Chris G.C. Jacobs, Yen-Ta Chen, Maurijn van der Zee
We were able to annotate 99 immune genes in the Oncopeltus genome. This number includes the Toll signaling components described in Section 5.1.g. Additionally, fifteen Defensins, seven lysozymes and five Hemiptericins were annotated, of which some have been previously identified [108]. Interestingly, no Cecropins, Attacins or Thaumatins were found. In addition, we found nine new potential antimicrobial However, classical cloning with degenerate primers was necessary to isolate the Oncopeltus ortholog. As for FADD and Tab2, gene models were absent from the genome assembly and coding sequences were even not found by tBLASTn, possibly because of extremely short exons (see also main text results and discussion on gene structure evolution and short exons within the Hemiptera). The sequences were finally found by de novo RNA-seq assembly (submitted manuscript cited above). IMD is also present in the closely related leaf hoppers Nilaparvata [109] and Homalodisca, suggesting that IMD was present in an ancestor of all Paraneoptera. Loss of IMD was reported from Acyrthosiphon [32]. IMD was also not found in the genomes of Sternorrhyncha is assumed, as found for instance when using the BI-DNA method (see Figure S 5.8, [110]). However, the most accepted phylogeny suggests multiple losses. These losses could mean that the Toll pathway is not only able to upregulate effector genes in response to Gram-positive bacteria, but also in response to Gramnegative bacteria. Interestingly, response to these distinct bacterial inputs is also not strict in Tribolium, and to some extent also not in Drosophila [111][112][113][114].  Previous work has indicated that conserved gene complexes are rare in insect genomes, and that only three are conserved over significant evolutionary time. One of these is the Hox complex (Section 5.1.b).

5.1.i Notch, Hedgehog, and Torso RTK pathways
A second conserved complex is the Runt complex. This set of four Runt domain containing genes is a feature of insect genomes, but has not yet been found in crustaceans [20]. Ancestrally (as seen in aphids and Pediculus) the complex contains Other gene families analyzed include the sp genes, where we have identified orthologs of sp4 and sp8/9 and confirmed the absence of sp5/buttonhead [116,117].  [50], including orthologs identified from the Oncopeltus fasciatus genome. Using a combination of methods, including BLAST homology searches and screening annotated Oncopeltus proteins with a custom HMM motif using the HMMER suite of programs, two Noggin-like/PTTH/trunk family members were identified in Oncopeltus. Bayesian phylogeny indicates one of these orthologs is most closely related to other insect PTTH proteins, and the other ortholog is a member of the Noggin-like family. Posterior probabilities are shown at nodes, the tree is rooted with the Noggin-like sequence identified in the Amphimedon genome. Multiple alignments were carried out using ClustalX [118]; the phylogeny was initially carried out under mixed models, and then with the most appropriate specific model using MrBayes. The Monte Carlo Markov Chain search was run with four chains over 1,000,000 generations with trees sampled every 1000 generations. The first 25% trees were discarded as 'burn-in'.

Contributors: Iris M. Vargas Jentzsch and Kristen A. Panfilio
The Wnt pathway is a signal transduction pathway with fundamental regulatory roles in embryonic development in all metazoans. The emergence of several gene families of both Wnt ligands and Frizzled receptors allowed the evolution of complex combinatorial interactions with multiple layers of regulation [119]. Wnt signaling affects cell proliferation and migration, as well as segment polarity, patterning and addition in most arthropods [120]. Here we strived to identify and curate the automated models corresponding to the main components of the Wnt signaling pathway.

Methods
The protein sequences for the wnt ligands as well as receptors and downstream components (armadillo/beta-catenin, dishevelled, frizzled, arrow, axin, shaggy/GSK-

3) from Drosophila melanogaster, Tribolium castaneum and Acyrthosiphon pisum
were retrieved from NCBI (we excluded accessions named as 'predicted' to avoid propagating errors from automated annotations). Using these as queries, we performed tblastn searches on the Oncopeltus scaffolds with a cutoff maximum evalue of 1e -10 . Hits from all species together were ordered by scaffold and start position, and for each group of overlapping or closely adjacent hits, the putative gene name was identified by blasting back the hit sequence against Arthropoda proteins in GenBank. The query sequences with the best hits (lowest e-value) for each group were then used to identify the model to be curated, by doing a tblastn search in the Oncopeltus scaffolds from the blast instance at the National Agricultural Library where the corresponding automated annotation models were edited. Homology, intron/exon boundary assessments, and protein sequence completeness were identified by manual inspection and correction of protein alignments generated with Clustal Omega (http://www.ebi.ac.uk/Tools/msa/clustalo/), and subsequent phylogenetic analyses at http://www.phylogeny.fr/ [42,43]. For this, we used all query sequences, and included additional orthologs from the crustacean Daphnia pulex and the myriapod Glomeris marginata. Potential gene duplications were also confirmed during this process.
The numbering for wnt and fz orthologs was assigned based on the corresponding vertebrate homolog (the naming of Drosophila orthologs was changed accordingly in phylogenetic analyses).

Results
A total of 25 models were curated for the main Wnt signaling genes on the Oncopeltus assembly (Table S 5 assembly.
In contrast, the tandem duplication of wnt8 in Oncopeltus seems to be unique, as a single wnt8 ortholog was identified in other hemipteroid species, including the bed bug Cimex lectularius. The Oncopeltus wnt8 paralogs are both expressed in the maternal/embryonic transcriptome [48], with additional expression support for wnt8b in the previously published adult (male) RNA-seq library ( [33]; see also main text Fig. 2b).
The Oncopeltus Wnt ligand repertoirewingless/wnt1, wnt5, wnt7, wnt8a and wnt8b, wnt10 and wntAis similar to the one found in the pea aphid (Acyrthosiphon pisum): wingless/wnt1, wnt5, wnt7, wnt11, wnt16 and wntA [87]. In comparison, Drosophila and Tribolium have 7 and 9 Wnt subfamilies, respectively. This supports observations of a reduction in the ligand repertoire in insects compared to the 12 Wnt subfamilies inferred to have been present the last common ancestor of all arthropods [122]. Furthermore, members of the Hemiptera seem to have the fewest Wnt gene families reported in insects, with some of these losses perhaps having occurred relatively recently and independently in this clade. Nevertheless, assessments of gene absence need to be done with caution when dealing with draft assemblies from second generation sequencing, which is the case for most recently published genomes.
All Wnt ligand models were found isolated on different scaffolds with the exception of the wnt8 paralogs, and wingless and wnt10. The latter two were clustered (on Scaffold 926) in the same transcriptional orientation and without other intervening genes, which is also the case in Cimex. This gene arrangement was also observed in Tribolium castaneum and Drosophila melanogaster, but with the wnt6 locus between the two genes [123], reflecting the ancient arrangement of wnt genes in metazoans [124]. Wnt6 was not found in the Oncopeltus assembly, and we also could not find evidence of wnt6 in Cimex lectularius. This corroborates the postulated evolutionary loss of wnt6 in Hemiptera [125], where the absence of wnt6 was correlated with the absence of maxillary palps in insects.
Three models were curated for the frizzled (fz) transmembrane receptor families: two isoforms for frizzled, and one frizzled-2. These are only two of the four ancient fz families expected to have been present in the common ancestor of arthropods: fz, fz2, fz3, fz4 [126]. The loss of fz4 was also observed in Acyrthosiphon pisum [87].   [129,130]. Similarly, Tribolium also requires Wg signaling for larval limb development [131]. In contrast, Wg signaling does not appear to be required for appendage patterning in Oncopeltus [132]. Thus, the earliest stages of appendage formation likely differ between hemimetabolous insects and holometabolous insects. Once the imaginal discs are formed, Hedgehog (Hh) signaling activates the expression of Wg and Decapentaplegic (Dpp), which interact to pattern the anterior-posterior and dorsal-ventral axes [133]. Oncopeltus Dpp has already been identified [132], and we have identified components of the Hedgehog pathway in the Oncopeltus genome. Studying these signaling pathways should provide interesting insights into the early development of appendages in hemimetabolous insects.

5.1.k Appendage patterning
In the leg imaginal discs, the regional patterning in the proximal-distal axis involves gradual demarcation of the distinct regions of the limbs. At first, the imaginal disc expresses two proteins, Distal-less (Dll) in the presumptive distal portion and homothorax (hth) in the proximal portion [134][135][136]. Hth expression leads to the nuclear localization of the Extradenticle (Exd) protein [137]. Oncopeltus Dll and hth have already been identified by Angelini and Kaufman [138] and appear to play similar roles in Oncopeltus and Drosophila legs. We have also identified exd in the genome. At the intersection of Dll and Hth, Dachshund is activated and patterns the medial portion of the developing leg in both Drosophila and Oncopeltus [138].
Thus, the regional patterning mechanisms appear to be generally conserved between In Drosophila, the imaginal discs adopt wing/haltere identity in response to the expressions of the wing selector genes vestigial (vg) and scalloped (sd), which encode proteins that form a heterodimeric complex [141,142]. Loss of vg expression leads to cell death in the wing blade [142], and loss of sd also leads to cell death and loss of cell proliferation in the wing pouch [143,144]

Jones, Hsiao-ling Lu
Germline development in the Metazoa is controlled by a relatively small number of conserved genes. We searched the Oncopeltus fasciatus genome for these genes Drosophila orb is involved in axis specification as well as germ cell development, and the functional significance of orb duplication in Oncopeltus is unknown. We did not identify any ortholog of the oskar gene, which has not to date been identified in any other hemipteran species and has likely been lost from this lineage of insects.

Contributors: Yi-min Hsiao, Hsiao-ling Lu, Chun-che Chang
We identified twenty-one Drosophila homologs of eye developmental genes in the milkweed bug Oncopeltus fasciatus (Table S 5

Abstract
The term "bristle" includes various hair-like structures with different functions, including mechanosensory and chemosensory hairs that are in constant contact with substrates and air, allowing the insect to sense its surrounding environment. For this reason, many genes involved in bristle development have been previously described as playing a role in neural development in Drosophila [152,153]. In this study we annotated and analyzed 88 genes known to be involved in bristle development. Our results show an overall high conservation of protein sequence compared with Drosophila despite a higher intron number.

Results and discussion
The most studied role of insect bristles is perhaps their function as sensory organs for detecting various environmental stimuli. During development, each bristle is built from a small cluster of specialized cells including sensory neurons and support cells [154]. The shaft of the bristle extends from a single cell primarily via cytoskeletal arrangements [155]. In the fly Drosophila, the development of these bristles is quite well described and is regulated by a set of conserved developmental genes [156]. QTL studies have uncovered dozens of candidate genes and regions linked to variation in bristle density and morphology [152]. Some of this variation is also attributable to changes in non-coding sequences of a number of conserved developmental genes such as the achaete-scute complex [157,158]. Based on fly genetics, we established and annotated a list of 105 genes known to be involved in neurogenesis and bristle development, as well as in variation in bristle number and density ( [152,153]; Oncopeltus compared with their Drosophila homologs suggests that in many cases this reduced size is due to missing exons. As neither sequence similarity nor RNA-seq data available support the presence of additional exons, a possible explanation for this absence could be a technical artifact due to problems in the assembly process or the impossibility to close all the gaps in the genome.

Contributors: Deniz Erezyilmaz and Yuichiro Suzuki
Although the milkweed bug has been a key research model for hemimetabolous endocrine studies since the 1960's [160][161][162][163][164], only a handful of genes that regulate the progression through postembryonic development have been cloned from Oncopeltus since that time [53,165]. We therefore searched for genes that are involved in regulation of the molt cycle, cuticle identity, or ecdysis (see also manuscript main text).
For the cytochrome P450 enzymes of the Halloween family, which synthesize the ecdysteroid hormones that trigger the molting cycle, we have found all of the key P450 genes involved in ecdysone biosynthesis, including spook, phantom, disembodied and shadow [166]. We also identified shade, which encodes a P450 enzyme that converts ecdysone to its active form, 20-hydroxyecdysone, at target tissues [167].
Pioneering studies in Drosophila of the transcriptional response to ecdysone [53], ultraspiracle (usp), a homolog of the mammalian Retinoid-X-receptor (RXR), and its heterodimeric binding partner, the ecdysone receptor (EcR). Another early gene with stage-specific expression, the nuclear receptor gene E93, is also present in the Oncopeltus genome. This gene was recently shown to be required for adult metamorphosis in other hemi-and holometabolous insects [169]. Among the 'delayed early genes', which are known targets of EcR/USP heterodimers in Drosophila [84,168,170,171], we additionally identified the nuclear receptors E78, HR3, HR4, βftz-f1, HR39. In Drosophila, the hormone receptor HR4 [172] represses early genes, but activates expression of βftz-f1, a midprepupal gene. βFtz-F1, in turn, is required for optimal expression of early genes at the next molt in flies [173]. Finally, we discovered an ortholog encoding HR38, an orphan receptor that has been shown to heterodimerize with USP to mediate ecdysone signaling in Drosophila without directly binding ecdysteroids [174].
Ecdysis is driven by eclosion hormone (EH; [175]), ecdysis triggering hormone (ETH; [176]), crustacean cardioactive peptide (CCAP; [177]), and bursicon [178]. We have found genes encoding orthologs of EH, ETH, and the ETH receptor in the Oncopeltus genome. We were also able to identify a portion of the ccap ortholog in the Oncopeltus genome, and its putative ortholog encoding its receptor was represented in transcriptomic data.

Contributors: Andrew J. Rosendale, Joshua B. Benoit, Yuichiro Suzuki
Changes in the expression of specific cuticle proteins have been associated with increased stress tolerance and insecticide resistance [179,180]. We identified 173 putative cuticle proteins using sequence motifs established by Willis [179,181,182] from the milkweed bug genome (Table S 5  1 Sequences that scored above the assigned cutoffs for the RR-1 and RR-2 models were classified as the corresponding type, whereas sequences with scores below the assigned cutoffs but above 0 were characterized as "unclassified" (for more information, see [181]).
Melanization of the cuticle has been suggested as critical in the prevention of excessive water loss [183]. Furthermore, understanding the physiological and molecular regulation of pigmentation synthesis has important implications for understanding the evolution of aposematic (warning) coloration, and we therefore analyzed components of the pathways responsible for the main color elements. We identified key genes associated with melanization and red pigment production for Oncopeltus. These include genes encoding Tyrosine hydrolase [184], Yellow [185], Dopa decarboxylase [184], Ebony [186], and at least two Phenol oxidases ( Figure S 5.15). RNAi studies of these genes show that these genes play critical roles in melanin synthesis [187,188].
In addition, the Oncopeltus abdomen and eyes also produce ommochrome pigments [189]. The regulation of ommochrome biosynthesis has been well characterized in Drosophila eye development. The key genes involved in this process (vermilion, arylformamidase or kynurenine formamidase, cinnabar, white and scarlet) were identified in the Oncopeltus genome. Thus, most, if not all, of the key enzymes involved in ommochrome synthesis appear to be present in the Oncopeltus genome.

Antioxidant genes
Insects possess a suite of antioxidant proteins that prevent oxidative damage associated with various physiological processes. In Oncopeltus, 22 genes associated with the tolerance of oxidative stress were identified. Compared to other insects, the antioxidant system of Oncopeltus was well conserved; however, the number of catalase paralogs was greater in Oncopeltus than in insects such as Drosophila and Tribolium (Table S 5.13). In Oncopeltus three superoxide dismutases (including Cu/Zn and Mn/Fe SOD) catalyze superoxide to H2O2 and four catalases are present to convert H2O2 to water and oxygen. A total of nine genes are involved in the reduction of H2O2, including five peroxidoxins and four thioredoxin peroxidases. The thioredoxin system, responsible for maintaining proteins in a reduced state and scavenging reactive oxygen species, includes two genes for thioredoxin reductase.
One dihydrolipoamide dehydrogenase scavenges nitric oxide and one gene for glutathione peroxidase catalyzes the breakdown of H2O2 and hydroperoxides. Both dual oxidase and nitric oxide synthase (one gene each) are involved in immune response. In Oncopeltus, antioxidants not only play an important role in normal metabolic processes, but also in the survival of oxidative stress from xenobiotic factors [190].

Aquaporin genes
Aquaporins (AQPs) impact organismal stress tolerance, specifically under periods of dehydration and cold stress, by the regulation of cellular water levels [191]. We have identified seven aquaporin genes for the milkweed bug that include those that encode for a Drosophila-integral protein (Drip), AQP2, AQP4 (two sequences), AQP5, AQP6 and Bib (Table S 5.15, Figure S 5.16). This number falls within the range of most insects (6)(7)(8) and Oncopeltus has members of each group previously identified from insects [191]. The number of genes is identical to those recovered from the bed bug, Cimex lectularius [191].

Heat shock protein genes
Heat shock proteins (Hsps) have been documented as key in relation to stress resistance under a multitude of conditions [192]. In general, there were no major expansions or retractions associated with Hsps for Oncopeltus when compared to other insects (Table S 5.15).

Contributors: Iris M. Vargas Jentzsch, Yuichiro Suzuki, Kristen A. Panfilio
Cytochrome P450 genes encode one of the largest and most diverse enzyme families, and they can be found in the genomes of all domains of life [193]. Among an ample range of functions, P450 enzymes play a central role in the metabolism of xenobiotics and the production of hormones (see also Section 5.2.b, above). Despite their high sequence variability and substrate specificity, the overall structure of cytochrome P450 genes is highly conserved [193,194].
As a starting point for our P450 curation, we took a sample of ten curated P450s in the bed bug Cimex lectularius [19], which represented all major clades in a phylogenetic tree these proteins. The selected gene models were used as queries for blastn searches into the O. fasciatus scaffolds, and we annotated 17 gene models from the corresponding matches (Table S 5

Contributors: Lucila Traverso and Rolando Rivera Pomar
In

Contributors: Lucila Traverso and Rolando Rivera Pomar
Neuropeptides are cell-to-cell signaling molecules that act as hormones, neurotransmitters, and/or neuromodulators of feeding, behavior or basic physiological processes. By homology search, we found 31 genes encoding at least 52 splicing variants of neuropeptides (see Table S  The Adipokinetic hormone (AKH) was not found, although its receptor. We cannot rule out gaps in the genome sequence, or highly diverging sequences that were not detected by our homology search.

Contributors: Markus Friedrich, Jeffery W. Jones, Megan Porter
The milkweed bug has been one of the earliest model systems to study basic mechanisms of pattern formation during insect compound eye development [195].
The milkweed bug compound eye has also been used for a variety of physiological analyses [196], but more detailed analyses of spectral sensitivities across the compound eye exist for other hemipteran species, most notably the water striders and, most recently, cicadas [197][198][199]. Together with the first molecular study of opsin diversity in the vetch aphid Megoura viciae [200], these studies produced direct includes a representative of the enigmatic but deeply conserved Rh7 opsin subfamily that has been described in D. melanogaster but not yet been functionally characterized [201]. The second extra-retinal opsin is a member of the recently characterized arthropsin subfamily, which was first reported from the Daphnia genome but has since then been detected in other arthropods including insects such as the pea aphid [202][203][204].

Contributor: Hugh M. Robertson
The gustatory receptor (GR) family of seven-transmembrane proteins in insects mediates most of insect gustation (e.g., [208,209]), as well as some aspects of olfaction, for example, the carbon dioxide receptors in flies [210][211][212][213]. The GR family ranges in size from a low of 6 genes encoding 8 proteins in the human body louse [214] and 10 genes in the honey bee Apis mellifera [215] to 215 genes encoding 245 proteins in the flour beetle Tribolium castaneum [57]. The other sequenced hemipteroid insects have intermediate sized families, with 77 genes in the pea aphid Acyrthosiphon pisum [216], 28 genes encoding 30 proteins in the kissing bug Rhodnius prolixus [217], and 24 genes encoding 36 proteins in the bed bug Cimex lectularius [19]. The GR family is more ancient than the OR family, which was clearly derived from within it, and is found in the crustacean Daphnia pulex [218], the mite Metaseiulus occidentalis [219], the tick Ixodes scapularis (HMR, unpublished), and many other animals (HMR, unpublished). This evolutionary history is reminiscent of the more recently described ionotropic receptors (IRs) [220][221][222], some of which also probably function in gustation.

Methods
TBLASTN searches of the genome assembly were performed using Cimex, Rhodnius, Acyrthosiphon, Pediculus, and Drosophila proteins as queries, and gene models were manually assembled in a text editor (TEXTWRANGLER). Iterative searches were conducted with each new Oncopeltus protein as query until no new genes were identified in each major subfamily. All of the Oncopeltus genes and encoded proteins are detailed in Illustrator.

The OR family
The OR family consists of the highly conserved and generally single-copy OrCo gene and 102 specific OR genes. The OrCo gene, like some of the IR genes below, is split across two scaffolds with exons on the ends of the scaffold separated by gaps and interdigitated with each other (Table S 5 .19), as are four of the specific OR genes. The specific OR genes are generally a lot smaller, and most were intact, however many have terminal or internal exons missing. Some of these could be repaired using the available whole body RNAseq, specifically raw reads that did not map to the assembly but are available in the Short Read Archive at NCBI. One OR gene was modeled as alternatively spliced with two protein products (Or88), so the total of potentially encoded ORs is 121, and only two of these is pseudogenic, leaving 119 potentially active ORs. There are, however, several additional fragments of OR genes in the assembly, some of which may represent intact genes, while some incomplete models might actually be pseudogenes.
The phylogenetic tree is rooted with the conserved OrCo proteins (Figure S  [216], the Oncopeltus ORs appear to be older expansions with relatively longer branches to most proteins.

The GR family
The GR gene set consists of 115 models, encoding 169 proteins (Table S 5 considerably larger than that of many other insects. Of these only 9 are clearly pseudogenic, but many models are currently missing termini or internal regions in gaps in the assembly, so their status remains uncertain. There are many genes modeled as alternatively spliced, in a fashion common to the GR family in several other insects, that is, with two or more long first exons spliced into shared C-terminal exons, although in most cases in the absence of transcriptome evidence these models remain hypothetical. The MAKER modeling had access to all available insect GRs in GenBank, for comparative information, and succeeded in building at least partial gene models for 20 of these 115 loci, while AUGUSTUS had partial models for many more, nevertheless all models required at least one change.
Oncopeltus has seven genes encoding proteins related to the highly conserved carbon dioxide receptors of flies and other insects [225], and these were named Gr1-7 ( Figure S 5.19). They cluster phylogenetically with four similar proteins from the Cimex. This carbon dioxide lineage is absent from all Hymenoptera sequenced to date, as well as Acyrthosiphon, Pediculus, and Rhodnius [214,216,217], so appears to have been lost repeatedly. A large related subfamily expansion was discovered in the termite Zootermopsis nevadensis [226], indicating that this gene lineage is indeed ancient in insects. It remains to be shown whether they participate in perception of carbon dioxide.
Oncopeltus has three genes encoding candidate sugar receptors, named Gr8-10 ( Figure S 5.19). This subfamily is absent from the blood-sucking hemipterans examined to date (Pediculus, Rhodnius, and Cimex), but is present as six genes in the plant-attacking pea aphid [19,214,216,217]. The only other conserved Gr is Gr11, which is an ortholog of the DmGr43a fructose receptor [227], and is also present as a single ortholog in the other hemipteroids, except Pediculus.
The remaining Oncopeltus GRs     [216]. Most of the other Drosophila GRs are implicated in perception of bitter tastants [228,229], however it is hard to be confident of such a function for these Oncopeltus GRs and their Cimex/Rhodnius relatives. It is nevertheless somewhat surprising that Oncopeltus has such a large repertoire of GRs.
It implies that they are adapted to sense a wide range of bitter taste chemicals, presumably employed in selecting suitable host plants.

The IR family
The IR family consists of at least 37 genes (Table S 5 contains two highly conserved receptors (Ir8a and 25a) that are closely related to the ionotropic glutamate receptors from which they evolved (see [221,226]), and which serve as co-receptors with the other IRs, along with Ir76b [222]. Another group of receptors (Ir21a, 40a, 68a, and 93a) are present in fairly conserved single orthologs, as is the case for most other insects, and several of these have recently been shown to be involved in perception of temperature and humidity in Drosophila [230][231][232]. The Ir41 and 75 lineages consist of multiple genes in most insects, and in Oncopeltus there are three and ten genes, respectively. In Drosophila they are involved in perception of acids and amines [233,234]. Following an approach begun with the termite Zootermopsis nevadensis [226], and applied to Rhodnius and Cimex, the conserved

Additional file 3: Chemoreceptor sequences in FASTA format. (TXT file)
Note that due to the fragmented nature of the genome assembly, the chemoreceptor gene models were predicted independently from the assembly by manually combining the gene predictions with RNA-seq data. Thus, some gene locus coordinates provided

5.4.b Sex determination and dosage compensation
Contributors: Subba (Reddy) Palli, Jayendra N. Shukla O. fasciatus is a male heterogametic (XX-female and XY-male) insect [239], but the sex determination signal remains unclear. The most downstream gene of Drosophila sex determination cascade, doublesex (dsx), is conserved in all the insects studied so far [240] as inferred from the studies in various groups of holometabolous insects [241], and is the founding member of proteins with DM domains [242][243][244].
However, the Oncopeltus genome contains DM superfamily genes in which no Oligomerization (OD2) domain was identified. This is similar to the case of two other hemipterans, Acyrthosiphon pisum and Rhodnius prolixus, whose genomes have been sequenced recently [32]. Functional analysis of DM domain genes in hemimetabolous insects is required to ascertain their potential roles in sex determination.
Homolog of transformer (tra), the upstream regulators of dsx in holometabolous insects (absent in lepidopterans and basal dipteran lineages, i.e., mosquitoes [241]) is present in Oncopeltus genome. Interestingly, the partial tra homolog obtained showed high sequence conservation in the auto regulation domain of the hymenopteran tra sequence ( Figure S 5.21 A,B). Whether Oncopeltus tra is spliced in a sex specific manner and regulates the splicing of its own and dsx pre-mRNA needs further investigation. Among other core sex determination genes homologs, transformer-2 [245], intersex [246], fruitless [247] and P-element somatic inhibitors [248] have also been identified in the Oncopeltus genome. Figure [250]). (C) Putative sex determination pathways in Oncopeltus fasciatus based on the sex determination cascade of holometabolous insects [251]. The homologs of sex determining genes doublesex (dsx), transformer (tra), transformer-2 (tra-2), fruitless (fru) and intersex (ix) are present in the genome of Oncopeltus fasciatus.
A B C

5.4.c Epigenetic machinery
Contributor: Elizabeth J. Duncan DNA methylation and post-translational modifications of histones are key regulators of chromatin structure and of gene expression, and these epigenetic systems have been associated with environmental responsiveness and phenotypic plasticity [252].
Like these insects, Oncopeltus also appears to have an intact DNA methylation system. The Oncopeltus genome encodes two copies of the maintenance methyltransferase Dnmt1, the de novo DNA methyltransferase Dnmt3, and a copy of Tet1 (Ten-eleven translocation methylcytosine dioxygenase 1) that has been implicated in removing methylation marks by converting 5 methylcytosine to 5 hydroxymethylcytosine [260,261].
In insects and other invertebrates DNA methylation (the addition of a methyl group to a cytosine residue in a CpG context) occurs predominately on gene bodies (exons and introns) [262][263][264]. In vertebrates DNA methylation occurs in CpG islands in the promoter regions of genes and is associated with gene silencing [265]. Gene body methylation also occurs in vertebrates and it has been shown to be as abundant as methylation in CpG islands [262]. The function of gene body methylation is currently unknown, however gene body methylation has been correlated with active transcription in a wide range of species [264], has been implicated in alternative splicing [254,255] and regulating chromatin organization [266].
Over evolutionary time methylation of cytosine residues leaves them susceptible to deamination to uracil, which is repaired as a thymine, leaving methylated genes with a relatively low CpG content [267]. The CpG content can be In the honeybee it has been found that there is a significant correlation between genes that have low CpG[o/e] (i.e., are predicted to be historically methylated) and genes that are currently methylated in the brains of honeybees [255], confirming that CpG[o/e] is a good predictor of genes that are currently methylated.

Methods
Gene body and intragenic sequences were extracted from the predicted Oncopeltus gene set (OGS v1.1) using CLC Genomics Workbench (version 7). For analysis of whole genome and intragenic regions, the sequences were split into 1000-nt nonoverlapping fragments, and nucleotide and dinucleotide content of gene body sequences and whole genome sequences were calculated as in previous analyses [268]. The number of components in these distributions was estimated in R (www.rproject.org) using mclust [269] model-based clustering. The best fitting model was identified among several non-nested models using Bayesian information criteria (BIC).  A.

B .
Drosophila melanogaste r C .

Histone encoding loci and histone modifying enzymes
The core unit of chromatin is the nucleosome, a highly conserved repeating unit composed of two copies of each of the four core histone proteins (H2A, H2B In Drosophila the histone genes are present in the genome in large numbers of quintet clusters, each cluster possessing one gene from each of the five classes of histone proteins. This arrangement of genes is also observed in other insects such as the pea aphid [271] and bed bug [19]. Oncopeltus does not have these quintet clusters and all of the histone genes are present as single copies on a scaffold.
The Oncopeltus genome encodes genes responsible for all classes of histone modifications; histone acetyltransferases, deacetylases, methylases and demethylases.
Unusually there are duplications of the histone acetyltransferases males absent on the first (mof), chameau (chm) and enoki mushroom (enok). Duplications of mof and enok have only previously been reported for the pea aphid [271] and the bed bug (Cimex lectularius) [19]. Phylogenetic analysis indicates that the duplications of these genes in Oncopeltus are independent of the duplications in the aphid genome and in the bed bug genome.

5.4.d Repressive C2H2 zinc finger effectors (KAP-1/ TRIM proteins)
Contributor  Note that all of these analyses are based on the OGS v1.1, and only one isoform was considered per gene (n= 19,519 proteins). The methods described here are in support of main text Fig. 3a-b.

Orthology analysis
The OrthoDB resource [273] was used in order to find shared orthologs among O.
fasciatus and the another ten arthropods (Daphnia pulex, Zootermopsis nevadensis, Cimex lectularius, Rhodnius prolixus, Acyrthosiphon pisum, Pediculus humanus, Apis mellifera, Tribolium castaneum, Danaus plexippus, and Drosophila melanogaster), using an expanded version of OrthoDB v.8 that also includes i5K species and mapped information for Rhodnius prolixus. Custom Perl scripts [274] were used in order to find the number of genes in each category shown in the bar chart in main text Fig. 3a.
For the categories "Present in majority of species" and "Patchy distribution", we required that the given ortholog was found in 9-10 or 2-8 species, respectively.
Proteins in the "Orthologs in other Arthropoda" category were found using the OrthoDB pipeline, while those in the "Homologs in other Arthropoda" did not meet those criteria but had hits with e-values <1e-05.
Furthermore, with the latest version of OrthoDB (v9.1, [275]), we specifically analyzed the 8,861 orthogroups resulting from orthology clustering analyses for the Hemiptera. Orthogroups are based on nine species that span the breadth of this order:

Phylogenetic analysis
For the phylogenetic analysis, an additional hemipteran species was included, the brown planthopper Nilaparvata lugens. However, since there was no available official gene set for this species, we extracted the sequences of conserved single-copy genes from the genome assembly using BUSCO [28] and mapped them onto OrthoDB.
Subsequently, we used the genes that were present as single-copy in all twelve species in order to build a concatenated phylogenetic tree using RAxML (Randomized Axelerated Maximum Likelihood, [276]). Briefly, a multiple sequence alignment was performed using MUSCLE [277] for each orthologous group, separately. Then, the resulting alignments were trimmed using trimAl [206] with parameters "-w 3 -gt 0.95 -st 0.01". The trimmed alignments were concatenated using the "seqret" program from the EMBOSS suite [278]. This concatenated alignment was used to build the phylogeny using RAxML 7.6.6 with the PROTGAMMA model of amino acid substitutions and 100 bootstrap replicates.

BUSCO-based quality assessment
The quality of genome assemblies can be measured by searching for the presence of conserved genes. Moreover, if these conserved genes are also single-copy, the assembly can also be tested for unexpected duplications, which can be a sign of erroneous haplotype assembly. To this end, we used the Benchmarking Universal Single-Copy Orthologs (BUSCO, v3: [28,29], scripts available under Gitlab project at https://gitlab.com/ezlab/busco), to measure the completeness of the milkweed bug genome as well as its set of predicted protein coding genes. We used both the Insecta and Arthropoda gene sets, which comprise genes that are present in at least 90% of the respective taxon. The values are highly similar between the two taxonomic datasets (    Fig.4a) of transcription factor abundance per family per species, log (base 2) scale, for 74 transcription factor families in 16 arthropod species (in Excel Supplement).  Oncopeltus proteins could be automatically classified to a transcription factor family, but without a specific orthology assignment. Each protein's DNA-binding domains (DBDs) are listed sequentially, from 5' to 3' within the amino acid sequence (in Excel Supplement).

Gene structure evolution
Contributor: Kristen A. Panfilio "Gold standard" manual curation gene set To analyze gene structure, we created a high quality ("gold standard") dataset of 30 genes whose manual curation could reasonably ensure complete and accurate protein coding gene models across seven species from four insect orders (see the main text for results and discussion on this section: main text Fig. 6a). We deliberately chose multi-exonic genes encoding fairly large proteins (median of 1720 aa), with clear single-copy orthology and strong RNA-seq expression to support model curation. For broad molecular sampling, genes were chosen that encode proteins of diverse functional classes, ranging from structural molecules to signaling pathway components, enzymes, and intracellular regulators of organelle structure and vesicular trafficking (detailed in Table S 6.6).

Splice site analysis
We annotated intron positions within multiple sequence alignments of orthologous proteins and plotted gains and losses onto a phylogeny for the genes hemocytin (also known as Hemolectin, Hml), Tenascin major (Ten-m), and UDP-galactose 4ʹepimerase (GalE) (see the main text for results and discussion on this section: main text Fig. 7; details on aligned splice positions' conservation is given in supplementary Tables S 6.7-6.9). The many-exon genes were taken from our initial "gold standard" dataset. The gene hemocytin (also known as Hemolectin, Hml) encodes a hemocyte clotting agent with numerous functional domains for cuticle-and protein-protein binding, and it had the most exons of any gene in the dataset (≤74 exons).
Unfortunately, it was not possible to annotate a complete model of this gene in the thrips, which as a thysanopteran represents a closely related outgroup to the Hemiptera, and therefore we additionally evaluated Tenascin major (Ten-m), which encodes a teneurin family multi-domain protein involved in neural development and synaptic transmission. At the other end of the spectrum, we chose an ancient gene with conservation extending to bacteria, such that the ancestral protein had no introns: UDP-galactose 4ʹ-epimerase (GalE) encodes an enzyme for sugar metabolism (EC 5.1.3.2) and is found across virtually all kingdoms of life. In fact, sequence conservation is so high at the nucleotide level that this gene initially came up as a potential candidate in our pipeline analyses of lateral gene transfer events in Oncopeltus (see also Supplemental Note 2.2), although we find here that the ancestral insect epimerase gene already had at least three exons.

Methods
Gene models: Single-copy orthologous gene models were obtained from public were marked (rounded to the nearest triplet/ whole amino acid position). A protein sequence alignment into which splice positions were encoded (denoted by the character "X") was then generated with ClustalW (accessed at http://www.genome.jp/tools/clustalw/) and manually refined.
Inference of evolutionary patterns: All splice positions were considered individually, and only those for which the sequence alignment was particularly poor were excluded (six sites within the less conserved N-terminal region and two other sites within hemocytin). Evolutionary patterns of splice position gain and loss were encoded as a binary presence/absence value per position for each species (Tables S 6.7-6.9), and the most parsimonious inference (fewest lineage-specific changes) to generate this pattern was assumed, with no weighting of probability for gains relative to losses.

Contributors: Patrice Baa-Puyoule, Gérard Febvay, Nicolas Parisot, Stefano Colella
See the main text for results and discussion on this section. The supplementary tables for this section are listed below.