The genome and developmental transcriptome of the strongylid nematode Haemonchus contortus

Background The barber's pole worm, Haemonchus contortus, is one of the most economically important parasites of small ruminants worldwide. Although this parasite can be controlled using anthelmintic drugs, resistance against most drugs in common use has become a widespread problem. We provide a draft of the genome and the transcriptomes of all key developmental stages of H. contortus to support biological and biotechnological research areas of this and related parasites. Results The draft genome of H. contortus is 320 Mb in size and encodes 23,610 protein-coding genes. On a fundamental level, we elucidate transcriptional alterations taking place throughout the life cycle, characterize the parasite's gene silencing machinery, and explore molecules involved in development, reproduction, host-parasite interactions, immunity, and disease. The secretome of H. contortus is particularly rich in peptidases linked to blood-feeding activity and interactions with host tissues, and a diverse array of molecules is involved in complex immune responses. On an applied level, we predict drug targets and identify vaccine molecules. Conclusions The draft genome and developmental transcriptome of H. contortus provide a major resource to the scientific community for a wide range of genomic, genetic, proteomic, metabolomic, evolutionary, biological, ecological, and epidemiological investigations, and a solid foundation for biotechnological outcomes, including new anthelmintics, vaccines and diagnostic tests. This first draft genome of any strongylid nematode paves the way for a rapid acceleration in our understanding of a wide range of socioeconomically important parasites of one of the largest nematode orders.


Background
The strongylid nematode Haemonchus contortus (barber's pole worm) is one of the most important parasites of livestock, and represents a large order of nematodes (Strongylida) that infect both animals and humans worldwide [1][2][3]. H. contortus infects hundreds of millions of sheep and goats globally, and causes deaths and production losses estimated at tens of billions of dollars per annum. This nematode feeds on blood from capillaries in the stomach mucosa, and causes hemorrhagic gastritis, anemia, edema and associated complications, often leading to death of severely affected animals [2]. H. contortus is transmitted orally from contaminated pasture to the host through a complex 3-week life cycle [4]: the eggs are excreted in the host feces, the first-stage larva (L1) develops inside the egg, then hatches (within about 1 day) and molts through to the second-stage (L2) and third-stage (L3) larval stages within approximately 1 week; the infective L3s are then ingested by the host, exsheath, and, after a histotropic phase, develop through the fourth-stage larvae (L4s) to dioecious adults; these two last stages both feed on blood.
Only four main drug classes have been available for the treatment of strongylid infections, and resistance against these classes is spreading worldwide [5]. It is thus highly desirable to search for new drug targets encoded in the H. contortus genome. Although vaccines using some native parasite antigens (called H11 or H-gal-GP) can partially prevent haemonchosis in experimental sheep, homologous recombinant molecules have failed to achieve protection [6]. Therefore, current treatment relies predominantly on the use of nematocidal drugs (anthelmintics). Because resistance against the main classes of drugs has become widespread [5], the ongoing design of new compounds, such as monepantel [7], cyclooctadepsipeptides [8], and derquantel-abamectin [9], is required. Discovering new drugs has been challenging, particularly because of the current limited understanding of the biology of H. contortus and how it interacts with its host [2]. Here, we describe a draft genome and developmentally staged transcriptome of H. contortus to substantially improve our understanding of this parasite at the molecular level. This genome provides a major resource to the scientific community for a wide range of genomic, genetic, evolutionary, biological, ecological, and epidemiological investigations, and a solid foundation for the development of new interventions (drugs, vaccines and diagnostic tests) against H. contortus and related strongylid nematodes.

Sequencing and assembly
We sequenced the genome of H. contortus (McMaster strain, Australia) at 185-fold coverage (Table 1; see  Additional file 1 Table S1), producing a final draft assembly of 320 Mb (scaffold N50: 56.3 kb; Table 1) with a mean GC content of 42.4%. We detected 91.5% of 248 core essential genes by CEGMA, suggesting that the assembly represents a substantial proportion of the entire genome. The estimated repeat content for this draft genome is 13.4%, equating to 42.8 Mb DNA. To overcome challenges in the assembly of the genome, we removed excessive repetitive and erroneous reads by khmer filtering [10] and normalization [11] to produce a representative assembly, an approach that should be useful for other complex genomes. This assembly contained 2.0% retrotransposons and 2.1% DNA transposons, which is similar to that reported for some other nematode genomes sequenced to date, including those of Caenorhabditis elegans, Pristionchus pacificus, and Ascaris suum [12][13][14]. We identified 40,046 retrotransposon sequences (see Additional file 1, Table S2) representing at least 9 families (4 long terminal repeats (LTRs), 3 long interspersed nuclear elements (LINEs), and 2 short interspersed nuclear elements (SINEs)). We also identified two families of DNA transposons (36,861 distinct sequences in total) and 235,635 unclassified repeat elements. The most abundantly transcribed repeat elements were DNA/TcMar and LINEs/retrotransposable elements (RTEs; see Additional file 1, Table S2). This richness of transposable element families is substantially higher than that predicted for other genomes of parasitic nematodes [13][14][15][16]. Overall, the present draft genome (320 Mb) is the largest of any animal parasitic nematode sequenced to date (for example, 273, 89, and 64 Mb for A. suum, Brugia malayi, and Trichinella spiralis, respectively) [14][15][16].
Excretory/secretory (ES) proteins are central to the parasite-host relationship [19]. We predicted the secretome of H. contortus to comprise 1,457 proteins with a diverse range of functions (see Additional file 1, Table S10). Most notable were 318 peptidases, including 98 metallopeptidases and 68 cysteine, 67 aspartic, 19 serine peptidases (predominantly clans MA, CA, AA, and SA, respectively) and 66 peptidase inhibitors (including fibronectin type III), 90 lectins (including C-type and concanavalin A-like), 65 sperm-coating protein/Tpx-1/Ag5/PR-1/Sc7 (SCP/TAPS) proteins, 38 transthyretin-like (TTL) proteins, and 27 kinases. Many secreted peptidases (comprising the 'degradome') and their respective inhibitors have known roles in the penetration of tissue barriers and feeding for a range of parasitic worms, including H. contortus [2,6,18]. Some of these ES proteins are involved in host interactions and/ or inducing or modulating host immune responses against parasitic worms, which are often Th2-biased [19].

Key transcriptional changes during developmental transitions in the life cycle
H. contortus development involves a number of tightly timed processes [4]. Embryogenesis generates the basic tissue types of the nematode, and each tissue type differentiates at a specific point in the developmental cycle. Post-embryonic structures required for parasitism and reproduction then differentiate in the larval stages L1 to L4. This includes the specialized development of the buccal capsule for blood feeding (from L4 onward), sexual differentiation at the L4 stage, and gametogenesis in the adult stage. Substantial growth occurs at the L2, L4, and adult stages. Development occurs in two different environments, on pasture for the free-living stages L1 to L3, and in the host for the dioecious L4 and adult stages ( Figure 2). Each of these stages has different requirements, in terms of motility, sensory perception, metabolism, and the regulation of hormones of the endocrine system. L3, which is the infective stage, and thus represents the transitional stage from a free-living to parasitic organism, persists in the environment until it is ingested by the host, where it then receives a signal (mainly CO 2 ) to commence its development as a parasite. The complexity of the H. contortus life cycle coincides with key developmental alterations in the nematode that probably require tightly controlled and rapidly regulated transcriptional changes.
We studied differential transcription from stage to stage, as the parasite developed from egg to adult ( Figure 2). The transition from the undeveloped egg to L1 was associated with significant upregulation of transcription for 1,621 genes encoding a substantial number of channels (n = 641), including LGICs and ES proteins (n = 397), GPCRs (134), transcription factors (TFs; n = 43), kinases (n = 100), and phosphatases (n = 35; see Additional file 1, Table S11). Although this expansion is probably associated with mitosis, organelle biogenesis, apoptosis, and overall gene expression during the rapid growth and development of L1 [4], based on knowledge of C. elegans [20], the expansion of some key subsets of channels, pore and electrochemical potential-driven transporters, GPCRs (classes A and SR), and various kinases/ phosphatases probably relate to chemosensation, mechanosensation, osmosensation, and/or proprioception of the free-living L1, as it rapidly adapts to its new and harsh external environment. The activity of L1s of H. contortus and their search for microbial food sources might reflect the expansion of ES proteins and associated peptidases and their inhibitors ( Figure 2). The switch to L2 sees an approximately 50% reduction in number of upregulated genes of the same groups, possibly reflecting the gradual adaptation to its free-living environment and a reduced level of stress in finding food. The transition to the L3 stage sees an 88.5% reduction in the number of differentially transcribed genes representing the same groups ( Figure 2), an expected finding, given that this stage undergoes ensheathment, is no longer able to feed, and must live on accumulated reserves at a reduced metabolic rate in order to survive (as an 'arrested' but motile infective L3 stage) for extended periods in the external environment [2]. Once ingested by the host animal, the transition from the L3 to the parasitic L4 and adult stages sees a renewed, massive surge in the number of differentially transcribed genes of the same spectrum of molecules and of structural proteins, but, as expected, very limited differences between the L4 and adult stages, with the exception of some genes Figure 2 Transcriptional changes in the life cycle of Haemonchus contortus. In a 3-week life cycle of the parasite, eggs (E) are excreted in host feces; the first-stage larva (L1) develops inside the egg to hatch and molt through to the second-stage (L2) and third-stage (L3) larval stages within a week. The infective L3s are then ingested by the small ruminant host, where they exsheath and, after a short tissue phase, develop through the fourth-stage larval (L4) stage to dioecious adults; both of these stages feed on host blood from capillaries in the internal wall of the stomach. Disease in the host relates to this blood-feeding activity. In this figure, changes in transcription in the transition from stage to stage are summarized and interpreted in the context of the biology of the parasite. Information is given on key genes differentially transcribed between adult female (Af) and male (Am) H. contortus, and involved in reproductive and other biological processes; gene codes follow those of Caenorhabditis elegans orthologs.
(for example, those encoding vitellogenin) that appear to relate specifically to reproduction ( Figure 2; see Additional file 1, Table S11).
During the key transitions in the life cycle (from egg to L2, and then from L3 to L4) linked to substantial growth and development [2], a range of genes encoding collagens and cuticular proteins are upregulated per transition ( Figure 2; see Additional file 1, Table S11). In the nematode cuticle, such molecules are crucial for the maintenance of nematode body shape, and also for protection against and contact with the external environment or host interface. We found prominent variation in transcription profiles among 28 individual collagen genes in the transition from the free-living to parasitic stages, consistent with previous immunoproteomic findings [21].
More than 120 peptidase genes were significantly upregulated in blood-feeding stages ( Figure 2; see Additional file 1, Table S11). Conspicuous among them were genes encoding secreted peptidases of various clans, including MA (metallopeptidases; M12A, M01, M13, M12A, M10A), AA (aspartic peptidases; all A01A) and CA (cysteine peptidases; mostly CA01A; see Additional file 1, Table S11), which have known roles in the degradation of tissues during the parasite's histotropic phase and digestion of blood components following establishment and buccal-capsule attachment to the abomasal wall, and might be crucial for growth, development, and survival of H. contortus in the host animal [2]. These findings support previous evidence showing that, for example, cysteine peptidases play a crucial role in the catabolism of globin by the cleavage of hemoglobin in blood-feeding nematodes [22][23][24][25]. Concomitantly, in the blood-feeding stages, we observed upregulated transcription of genes encoding succinate dehydrogenase subunit B and glutamate dehydrogenase genes via the respiratory electron transport chain [26] (proposed to maintain the redox balance in response to the accumulation of the end products from anaerobic metabolism [27]), and hemoglobin-like proteins [28] (probably involved in oxygen uptake, osmotic regulation, iron storage and/or oxygen-detoxification [29]). We also found increased transcription of genes encoding enzymes, including glutathione S-transferase, cytoplasmic Cu/Zn superoxide dismutase, catalase, glutathione peroxidase and/or peroxiredoxin, which are likely to have roles in heme transport or detoxification of reactive oxygen species from endogenous metabolic activities from the host during H. contortus infection; this is supported by findings from previous investigations [30][31][32] and recognized as characteristic of tissue-dwelling or blood-dwelling parasites [32].

Parasite-host interactions
Considering the substantial attack against H. contortus within the host, many ES proteins are expected to play crucial roles during parasite establishment, infection, immune modulation, or evasion. This expectation is supported by abundant transcription in the L4 and adult stages of genes encoding peptidases (n = 142), SCP-like extracellular proteins (including 20 neutrophil inhibitory factors; NIFs), lectins (n = 23), TTL proteins (n = 10), peptidase inhibitors (n = 6; including 4 Kunitz-like molecules) and fatty acid retinoid binding proteins (n = 4; see Additional file 1, Tables S10 and S13). In total, 333 of 1,457 genes (22.9%) encoding ES proteins were transcribed at significantly higher levels in the parasitic compared with the free-living stages (see Additional file 1, Table S10). The genome-wide average for this upregulation was significantly lower (14.7%; P = 5.8 × 10 -12 ).
In the hematophagous stages, we identified 54 upregulated genes encoding SCP/TAPS proteins [35], characterized by one or more SCP-like extracellular domains (IPR014044 and/or IPR001283). These proteins, originally found in hookworms, are also called activation-associated proteins or Ancylostoma-secreted proteins (ASPs) [36,37]. Although the numbers of genes inferred to express SCP/ TAPS proteins were similar between the L4 and adult stages (see Additional file 1, Tables S11 and S14), there were qualitative and quantitative differences in transcription compared with other developmental stages. Although two genes encoding the SCP/TAPS proteins Hc24 and Hc40 were identified previously in ES products of adult H. contortus [38,39], we identified 82 more such genes (see Additional file 1, Table S14). This finding supports a previous proposal for a wide array of molecules of this group in H. contortus [40], and suggests a diversified, active, and specific involvement of SCP/TAPS proteins in infection. Of the 84 SCP/TAPS proteins (62 single and 22 double SCP-like domain molecules) encoded by H. contortus, 74 were found to have homologs in C. elegans (see Additional file 1, Table S14); the 10 H. contortus-unique SCP/TAPS proteins, some of which are upregulated in parasitic stages, probably relate to host interactions and/or disease. Although SCP/TAPS proteins are still enigmatic, in terms of their functions, they deserve detailed investigation, particularly given that they are being explored as vaccine candidates for other hematophagous nematodes. For instance, in the human hookworm Necator americanus, Na-ASP-2 was tested in a phase I clinical trial, owing to its known protective properties in humans [41], although initial vaccination with the recombinant protein in adjuvant resulted in unexpected allergic responses following natural exposure to the parasite [42]. The crystal structure of Na-ASP-2 shows charge segregation similar to that of mammalian chemokines, indicating that this molecule might be an agonist or ligand for some GPCRs, such as chemokine receptors [43]. Of the 84 SCP/TAPS proteins identified in H. contortus, 20 were NIFs and predicted to be abundant in ES products, and have already been found in some other nematodes [19]. Although NIFs have not been reported previously for H. contortus, an Ancylostoma caninum homolog (SCP-1) is known to bind host integrin CR3 (CD11b/CD18) and to be able to inhibit neutrophil function, including oxidative burst [44,45].
As expected from previous molecular studies [40,46], eight genes encoding NIM-like proteins were found to be abundant in the hematophagous stages of H. contortus (see Additional file 1, Table S13). Although the functional roles of NIM proteins are unclear, they are likely to be involved in host-parasite interactions, because they are abundantly transcribed in parasitic stages. Most have N-terminal signal peptides and, thus appear to be actively excreted/secreted [40], although there is variation in the abundance of these proteins among different populations of H. contortus [40,46,47].
Of 53 genes encoding TTL proteins, 10 were significantly upregulated in parasitic stages of H. contortus (see Additional file 1, Table S13). Most TTL proteins identified to date are relatively conserved across large evolutionary distances [48], and are enzymes of purine catabolism that catalyze the conversion of 5-hydroxyisourate to 2-oxo-4hydroxy-4-carboxy-5-ureidoimidazoline [49,50]. In metazoans, TTLs can also bind hormones, such as thyroxine (T4) and vitamin A [51], and can enable cell corpse engulfment by binding surface-exposed phosphatidylserine on apoptotic cells [52]. Among the proteins encoded by nematode-specific genes, TTLs represent one of the largest groups [53,54]. A subset of TTL proteins has also been identified in Ostertagia ostertagi, a nematode related to H. contortus [55,56], in the human filarial nematode B. malayi [57], and in the plant-parasitic nematodes Xiphinema index, Heterodera glycines, Meloidogyne incognita, and Radopholus similis [53,[58][59][60]. For example, in O. ostertagi, at least 18 ttl genes have been identified by data mining, most of them being constitutively transcribed from the free-living L3 through to adult females and males [56]. In H. contortus, a TTL has been isolated from ES products from adult worms and shown to be immunogenic [40], and TTL homologs are also abundant in An. caninum ES [61]. These data suggest the testable hypothesis that TTLs, together with SCP/TAPS proteins, play key roles in host interactions.

Immune responses
Based on the current knowledge and understanding of immune responses against helminths in animals [19], we compiled a comprehensive list of H. contortus ES homologs with known immunomodulatory or immunogenic roles in other nematodes (see Additional file 1: Table S13). Such homologs upregulated in the L4 and adult stages represent 5.6% of the predicted H. contortus secretome, which is significantly lower than the genome-wide average of 14.7% (P<10 -6 ). In addition to the molecules HcES15 and HcES24, whose precise functions are still unclear, proteins within this secretome that are predicted to direct or suppress immune responses include close homologs of N-acteylglycosaminyltransferase and leucyl aminopeptidase ES-62 of the filarioid nematode Acanthocheilonema vitae [19]. ES-62 is known to inhibit B-cell, T-cell and mast cell proliferation/responses, induce a Th2 response through the inhibition of IL-12p70 production by dendritic cells, and promote alternative activation of the host macrophages via the inhibition of Toll-like receptor (TLR) signaling [19]. Other molecules of H. contortus predicted to be immunomodulatory include homologs of another B-cell inhibitor (CYS-1), 8 serpins and 20 NIFs [19]. Some H. contortus ES proteins are predicted to be involved in immune evasion; for instance, some could mask parasite antigens by mimicking host molecules (for example, C-type lectins, concanavalin A and galectins) [19]. In spite of some similarities among nematode-host systems, based on the nature and extent of molecules identified, the host immune responses against the parasitic stages of H. contortus appear to be distinct from those associated with other nematodes, such as Ascaris and filarioids, which is supported by other experimental findings [19]. Taken together, the present findings indicate that H. contortus has a substantial arsenal of ES proteins that are likely to be involved in modulating, evading, and/or blocking immune responses in the host.

Vaccine molecules
There has been a major emphasis on the development of vaccines to fight against haemonchosis [6]. Most effort has been directed at inducing immunity in sheep against proteins expressed in or excreted/secreted from the gut of H. contortus, with the aim of disrupting or inhibiting the parasite's digestion of host blood. To date, the two most effective immunogens assessed have been the aminopeptidase family H11 [62,63] and the Haemonchus galactose-containing glycoprotein complex (H-gal-GP) [64]. Both of these molecular complexes contain integral membrane proteins with hemoglobinase activity, are expressed mainly in the microvillar surface of the parasite's gut, and induce 70 to 90% protection against infection in a number of sheep breeds [6]. In the current study, using genomic and transcriptomic data, we were able to define the different molecular variants within these two complexes.
We found that H11 represents a group of 25 different metallopeptidases (clan MA; family M01; see Additional file 1, Table S15), which are upregulated six-fold to 210-fold in the parasitic over the free-living stages of H. contortus. Key components of H-gal-GP, representing predominantly metallopeptidases (for example, MEPs 1 to 4) [65,66], aspartyl peptidases (for example, HcPEPs 1 and 2) [67,68], and cysteine peptidases (for example, AC-1 to AC-5; HMCP-1 to HMCP-6) [22,67,[69][70][71], were also identified using sequence data from previous proteomic studies (see Additional file 1: Table S15). Again, as expected from previous studies [6], all three classes of peptidases were significantly upregulated in the L4 and adult stages (see Additional file 1, Table S15). We found substantial diversity in the cysteine peptidases (n = 81), which have been also under close scrutiny as vaccine candidates. Many of these enzymes (n = 14) represent clan C01A (cathepsin B-like peptidases), and 34.6% were represented in the ES degradome (see Additional file 1, Tables S4 and S16). We also identified 11 legumains (clan CD; family C13), which might activate key family C01A peptidases through cleavage of the peptide backbone between the pro-segment and mature enzyme domains [72].
In addition, the serine peptidase complex contortin has received attention as an efficient anticoagulant (with dipeptidyl IV activity) in parasitic stages of H. contortus [73]. Contortin is inferred to belong to clan SC serine peptidases (family S28). We found 13 family S28 representatives among the 107 serine peptidases predicted for H. contortus (see Additional file 1, Table S15), all of which were upregulated in the parasitic stages. Nine of these thirteen lysosomal Pro-Xaa carboxypeptidases were represented in the ES degradome (see Additional file 1, Table S16), supporting the contention that contortin is also immobilized [73]. Interestingly, H. contortus shares many of these key classes of peptidases with other hematophagous parasites, including hookworms, indicating relative conservation in sequence and function linked mainly to feeding (blood meal digestion or anticoagulation). Studies to date have shown that selected recombinant proteins representing H11 and H-gal-GP do not induce protective immune responses, and carbohydrate moieties alone are also not protective [6]. Therefore, the combined use of proteomic and glycomic tools, underpinned by the present genomic and transcriptomic data sets as well as by animal experimentation, should be advantageous for designing future vaccines.

Prediction and prioritization of drug targets
The excessive and uncontrolled use of a small number of drug classes for the treatment of haemonchosis has led to major problems of drug resistance in H. contortus to most of these compounds [5]. Unfortunately, only a very small number of new anthelmintics (cyclooctodepsipeptides and aminoacetylnitriles) have been discovered in the past two decades using traditional chemical screening approaches [7,74]. Genome-guided drug target or drug discovery provides an alternative means to conventional screening and repurposing [75]. The aim of genome-guided discovery is to identify genes or molecules whose inactivation by one or more drugs will selectively kill parasites but not harm the host animal. Because H. contortus and related strongylid nematodes are challenging to maintain outside of their hosts, and gene-specific perturbation by double-stranded RNA interference (RNAi) is inconsistent [76], directly assessing gene essentiality on a large scale is not yet practical. However, essentiality can be predicted from functional information (for example, lethality) for C. elegans, and this approach has already yielded credible targets for nematocides [77]. For H. contortus, we inferred 641 molecules with essential homologs in C. elegans linked to lethal phenotypes upon gene silencing (see Additional file 1, Table S17). We also screened for enzymatic chokepoints in biological pathways of H. contortus. Such chokepoints represent reactions that consume or uniquely produce a molecular compound; the disruption of such enzymes should lead to a toxic accumulation (for unique substrates) or starvation (for unique products) of metabolites within cells [78,79]. We gave the highest priority to targets inferred to be encoded by single genes, reasoning that lower allelic variability in H. contortus populations would be less likely to give rise to drug resistance. Using this stringent approach, we predicted 260 druggable proteins in H. contortus (see Additional file 1, Table S17), of which 106 had ligands fulfilling the Lipinsky rule of five [80] (Table 3). Conspicuous among these were 17 channels or transporters, which represent protein classes known to be targets for anthelmintics, including macrocyclic lactones, levamisoles, and AADs [7,81,82], and other candidates including 27 kinases, 7 TFs, and 4 phosphatases known to be specific targets for norcantharidin analogues [77]. This list of prioritized target candidates could be tested for anti-nematodal effects in larval development assays or directly in experimental sheep, and should enable rational anthelmintic design.

Prospects for functional genomics
Genomic-guided drug discovery would be assisted by assessing essentiality of drug targets directly in H. contortus itself. Likewise, functional analysis of the approximately 30% of H. contortus genes that are parasite-specific, some of which are likely to play key roles in host-parasite interactions, would also be enabled by such gene inactivation. However, to date, gene-specific silencing in the parasite itself has been plagued by inconsistent results [76]. Recent findings suggest that this challenge can be overcome if the conditions for effective RNAi were optimized [83,84]. Inconsistent RNAi in H. contortus is apparently due to inefficient double-stranded RNA delivery, incomplete knowledge of the RNAi machinery, and variability in gene transcription in different stages or tissues of the parasite. Using the gene set of H. contortus, we identified 229 genes encoding proteins involved in the RNAi pathway (Figure 3; see Additional file 1, Table S18), including rde-4 and rsd-2, both previously thought to be absent [85], although we did not find rde-2 or sid-2. We also found that most RNAi genes in H. contortus are upregulated at the L2, L4, and adult stages (see Additional file 1, Table S18). These findings suggest that future assessments of gene function in H. contortus should focus on using these stages, which are most likely to be amenable to RNAi.

Conclusions
The genomic and transcriptomic exploration of H. contortus provides new insights into the molecular biology of one of the most important parasites of small ruminants worldwide. This investigation has elucidated transcriptional alterations taking place throughout the life cycle, particularly during the transition from the free-living to the parasitic stages, and has emphasized molecules involved in host-parasite interactions and immune responses. Determining the genome sequence and transcriptomes of H. contortus can accelerate post-genomic explorations of genes and gene products involved in nematode development and reproduction, future proteomic and metabolomic studies, parasite-host interactions, and pathogenesis of disease. The characterization of the RNAi machinery for H. contortus also provides a solid platform for functional genomic work in selected stages of the parasite. Therefore, an integrated systems biology approach should provide novel strategies for parasite intervention via drugs, vaccines, and diagnostic tests. For instance, future work could focus on defining a spectrum of key molecules involved in pathways linked to the development of the nervous system in different stages of H. contortus, and assessing their potential as drug targets. Moreover, exploring unique groups of molecules, such as SCP/TAPS, TTLs, and the complex array of peptidases, and understanding the roles of these molecules in host-parasite interactions is likely to support the design of new interventions.
The complexity of the genome of H. contortus very probably relates to substantial sequence heterogeneity in non-coding regions among individual worms in the populations used for sequencing, and possibly even within individual worms. Although it is unclear why there is so much sequence variation in some genomic regions, it seems that this parasite mutates at a very high rate in non-coding regions, which might explain, then to some extent, the parasite's ability to rapidly become resistant to anthelmintics. To date, much research has focused on investigating possible associations between resistant phenotypes and mutations in particular candidate genes. Having available a draft genome of H. contortus now provides a solid foundation for genome-wide studies to identify genetic loci associated with anthelmintic resistance, and to explore their inheritance and mechanisms of drug resistance. Such studies will require an improved understanding of the population   biology and genetics of this parasite, and knowledge of how mutations arise and are inherited. We expect drug resistance in H. contortus to be multigenic, and we hypothesize that complex resistance mechanisms operate in this nematode, possibly even involving microRNAs. Clearly, the genome of H. contortus will underpin future research in these and many other areas. Although the present study focused on H. contortus, the findings and the technological approaches used will be applicable to other parasitic nematodes of major animal and human health importance. Importantly, this first draft genome for a strongylid nematode paves the way for a rapid acceleration in our understanding of a wide range of socioeconomically important parasites of one of the largest nematode orders.

Production and procurement of H. contortus
Animal ethics approval (no. 0707528) was granted by the University of Melbourne. H. contortus was produced in Merino lambs (3 months of age; Victoria, Australia) maintained under helminth-free conditions. Sheep were inoculated intraruminally (via oral intubation) with 5,000 to 10,000 infective L3s of H. contortus. Eggs were isolated from the feces of infected sheep (1 month after inoculation) using a sucrose flotation procedure [86]. L1s, L2s, and L3s were produced in culture (at 27°C), as described previously [87]. L1s, L2s, and L3s, identified according to Veglia [4], were collected after 1, 4 and 8 days, respectively, and washed extensively in tap water. L4s and adults of H. contortus were collected from the abomasa of infected lambs following euthanasia by intravenous injection of pentobarbitone sodium (Virbac, Carros Cedex, France) 13 days and 1 month, respectively, after infection with L3s. These latter two developmental stages of H. contortus were washed extensively in physiological saline, and males and females separated before freezing. All of the developmental stages of H. contortus collected were snap-frozen in liquid nitrogen and then stored at -70°C until use.

RNA sequencing and transcriptome assembly
Total RNA was isolated separately from different developmental stages (egg, L1 to L4, and adult) and sexes (male and female) of H. contortus (Haecon-5 strain, Australia) using TriPure isolation reagent (Roche Molecular Biochemicals, Mannheim, Germany). For L1 to L3, packed volumes of 20 to 50 µl were used, equating to thousands of larvae. For L4 and adult stages, packed volumes of 50 to 200 µl were used, usually equating to 100 to 200 worms per aliquot. RNA yields were estimated spectrophotometrically (NanoDrop 1000; Nano-Drop/Thermo Scientific Inc., Pittsburgh PA, USA), and the integrity of RNA was verified using a BioAnalyzer 2100 (Agilent Technologies Inc., Wilmington, DE, USA).
RNA sequencing (RNA-seq) was carried out as described previously [88]. The sequences derived from each library representing each stage and sex were assessed for quality, and adaptors removed. Following removal of any potentially contaminating sequences, RNA-seq data for all stages and both sexes were assembled into de novo predicted transcripts using the programs Velvet and Oases [89] or SOAPdenovo [90]. Non-homologous transcripts were first used to train the de novo gene prediction programs SNAP [91] and AUGUSTUS [92], and transcripts were then used to assist the evidence-based prediction of the non-redundant gene set for H. contortus.

Genomic sequencing and initial assembly
High molecular weight genomic DNA was isolated from adult male and female H. contortus (McMaster strain, Australia) using an established protocol [93]. The specificity of genomic DNA was verified by automated sequencing of the second internal transcribed spacer (ITS-2) of nuclear ribosomal DNA following PCR amplification from genomic DNA. Total DNA amounts were determined using a Qubit Fluorometer dsDNA HS Kit (Invitrogen), in accordance with the manufacturer's instructions. Genomic DNA integrity was verified by agarose gel electrophoresis and using a 2000 BioAnalyzer (Agilent). Mate-pair genomic libraries (with 300 bp and 500 bp inserts) were built [88], and checked for both size distribution and quality with a 2100 BioAnalyzer (Agilent). Jumping genomic libraries (with 2, 5, and 10 kb inserts; see Supplementary Table 1) were constructed as described previously [94]. To produce sufficient amounts of DNA for the jumping libraries, 250 to 500 ng of genomic DNA were subjected to whole genome amplification (WGA) using the REPLI-g Midi Kit (Qiagen Inc., Valencia, CA, USA), in accordance with the manufacturer's protocol. All sequencing was carried out on Illumina machines (either GA II or HiSeq; Illumina Inc., San Diego, CA, USA) with 2 × 75 or 2 × 100 reads for pairedend libraries, and 2 × 49 reads for jumping libraries. For all sequencing, reads were exported to FASTQ format [95]. Several steps were taken to enforce read quality. Custom Perl scripts were used to trim the final nucleotide of each read, nucleotides with a quality score of less than 3, or 'N' residues. Quality-trimmed reads were kept if they were 65 nt or more long from paired-end data or 48 nt or more long from jumping-library data. We used a modified version of the read-decontamination pipeline of Kumar and Blaxter [96] to rid the genomic and RNA-seq datasets of any possible contaminating sequences of mammalian, bacterial, mycotic, protistan, and plant origins. In brief, genomic and RNA-seq reads were assembled into preliminary contigs using SOAPdenovo, without scaffolding for genomic DNA and using oases with scaffolding for RNA-seq. For genomic DNA contigs and cDNA scaffolds, minimum contig sizes of 60 and 200 nt, respectively, were accepted.
We mapped reads to the preliminary contigs with the program Bowtie 2 [97,98]. Unlike Kumar and Blaxter [96], we then performed exhaustive MegablastN [99] searches on all contigs (rather than using a random subset) to determine which sequences had likely contaminant status. MegablastN searching was done against opposing custom nematode and contaminant genomic DNA databases: the nematode set represented genomic assemblies from C. elegans, P. pacificus, A. suum (from WormBase WS230), and Ancylostoma ceylanicum (Schwarz EM, unpublished data). The contaminant set included sheep and cow genomic sequences, 1,991 bacterial genomes (including Prevotella ruminicola) from the European Nucleotide Archive (ENA), and a bovine rumen metagenome [100]. Because A. ceylanicum is a strongylid nematode parasite, related to H. contortus [101], we expected that any H. contortus contigs of genuine nematode origin were highly likely to have a better MegablastN hit to A. ceylanicum or C. elegans than to any contaminants. Each preliminary H. contortus contig was thus classed as a contaminant if it had a score against the contaminant database of 50 bits or more, and which was at least 50 bits higher than any match by that contig against the nematode database.
We exported all reads that failed to map to a contaminant contig (including both the reads that mapped to non-contaminant contigs and those that did not map to a contig at all). This set of reads was then used for genome and transcriptome assembly, and for quantifying transcription levels. Although our pipeline for decontamination is similar to that of Kumar and Blaxter [96] and uses much of the same source code, it differs by not trying to classify contigs as contaminants based on GC percentage or coverage levels (which we found not to work well with our data), but by using exhaustive MegablastN searching instead.
Our genomic reads, even after initial quality filtering, could not be assembled with Velvet because they required more than 256 GB of system RAM, the maximum amount available to us on our largest server. Therefore, for Velvet assembly, we used khmer to digitally normalize read frequencies [11]. First, we constructed a hash table of 75 GB in size, scanned through the paired-end genomic reads, and discarded reads with 20-mers that we had already found 50 times in previous reads. We rescanned the reads, discarding those with unique 20-mers, reasoning that unique 20-mers in such a large dataset were likely to represent sequencing errors or trace contaminants; khmer estimated the false positive rate of the hash table to be less than 0.001. The khmer filtering automatically converted the reads from FASTQ to FASTA format. We assembled khmer-filtered reads into a H. contortus genome sequence with Velvet 1.2 [102]. For our final Velvet assembly, velveth was run with k = 21; for preliminary assemblies, velveth was run with values from k = 41 down to k = 19. The velvetg parameters were as follows: -shortMate-Paired3 yes -shortMatePaired4 yes -shortMatePaired5 yes -cov_cutoff 4 -exp_cov 100 -min_contig_lgth 200 -ins_length 300 -ins_length_sd 50 -ins_length2 500 -ins_length2_sd 200 -ins_length3 2000 -ins_length4 5000 -ins_length5 10000. Because Velvet 1.2 prioritizes paired-end over jumpinglibrary data, artifactual long-range connections were less likely to confound assembly.
Preliminary assemblies with k-mer values at or near 41 were approximately 700 Mb in size, more than twice the estimate of the genome size (315 Mb) of H. contortus based on Feulgen image analysis densitometry [103]. The khmer filtering allowed us to complete assemblies with k values as low as 21. With decontaminated reads, k = 21 resulted in an assembly size of 404 Mb, perhaps because using unusually small word values allowed Velvet's de Bruin graph to merge polymorphisms rather than treat them as distinct, allelic sequences; scaffold N50 was 13.6 kb. The k = 21 assembly showed relatively low levels of non-scaffolding residues (79.3% non-N), but we improved this percentage to 95.8% non-N residues by adding reads to the assembly with GapCloser from SOAPdenovo. GapCloser also modestly increased the assembly size to 414 Mb and scaffold N50 to 17.6 kb.
To improve the Velvet assembly, we used SOAPdenovo (version 2.0) to scaffold the 404 Mb assembly using errorcorrected reads without khmer filtering. The program GapCloser was used to close gaps in the scaffolded assembly. With k = 21, this gave us an assembly of size 453 Mb that achieved an N50 of 34.2 kb after gap-closure, with 93.8% non-N residues. For both Velvet and SOAPdenovo, we tested the gap-filled k = 21 assemblies using the program CEGMA [104]. For Velvet, we predicted 157 of 248 conserved eukaryotic genes (CEGs; 63%) completely, and 211 of 248 (85%) at least partially. Using SOAPdenovo, we predicted 182 of 248 CEGs (73%) completely, and 232 of 248 (91.5%) at least partially. Given the superior N50, completeness of assembly and the prediction of more CEGs, we selected the SOAPdenovo scaffolded assembly.

Final draft genome assembly
The initial assembly (453 Mb) was substantially larger than the genome size estimate based on Feulgen image analysis densitometry (315 Mb). Therefore, we re-evaluated the genomic sequence composition by comparing assembled DNA scaffolds containing at least one predicted proteincoding gene against assembled DNA scaffolds that had no such prediction. For this comparison, we did not use only the approximately 26,000 protein-coding genes in our final set, which all had RNA-seq evidence to support their expression, but instead used a larger set of approximately 29,184 genes, which included both RNA-seq-supported and protein-supported predictions. Moreover, we mapped Illumina sequencing reads that had been decontaminated but not yet subjected to digital normalization with khmer, so that variations in sequencing coverage would remain detectable. We found that scaffolds containing proteincoding genes had a much higher coverage of Illumina sequencing reads (with a single distinct peak) than scaffolds that were completely devoid of predicted proteincoding genes (which were represented by a very lowcoverage peak). All the high-coverage coding regions were contained within a total of 320 Mb of scaffolded sequences, whereas all of the low-coverage non-coding regions were within a total of 133 Mb of scaffolded sequences. Moreover, when we examined the two sequence sets with CEGMA for completeness of gene content, the 320 Mb set was essentially identical to the 453 Mb assembly, whereas the 133 Mb set was almost completely devoid of gene content. We therefore selected the 320 Mb scaffold set as our final draft assembly. Lowcoverage scaffolds might represent a residue from the khmer removal of sequences with high coverage, and heterozygosity/heterogeneity/haplotype differences linked to non-coding regions, possibly due to variations among individual worms of the population.
Identification and annotation of non-coding regions and protein-coding genes Genomic repeats specific to H. contortus were modeled using the program RepeatModeler [105] by merging repeat predictions by RECON [106] and RepeatScout [107]. Repeats in the H. contortus genome assembly were identified by RepeatMasker [108] using modeled repeats (via RepeatModeler) and known repeats in Repbase (version 17.02) [109].
The H. contortus protein-coding gene set was inferred using an integrative approach, utilizing the transcriptomic data for all stages and both sexes sequenced in the present study. First, all 185,706 contigs representing the combined transcriptome for H. contortus were run through BLAT [110] and filtered for full-length open reading frames (ORFs), ensuring the validity of splice sites. These ORFs were then used to train the de novo gene prediction programs SNAP [91] and AUGUSTUS [92] by producing a hidden Markov model (HMM) for each program. The same ORFs were also given (as an expressed sequence tag (EST) input) to MAKER2 [111] to provide evidence for predicted genes. In addition, all raw reads representing the combined H. contortus transcriptome were run through the programs TopHat and Cufflinks to provide additional information on transcripts and on exon-intron boundaries in the form of a Generic Feature Format (GFF) file.
HMMs, the EST input, and the GFF file were subjected to analysis using MAKER2 to provide a consensus set of 27,782 genes for H. contortus. Genes inferred to encode peptides of 30 or more amino acids in length were preserved, resulting in the prediction of a total of 27,135 genes. To account for the genes in DNA repeat regions, identified by RepeatMasker, we removed genes (n = 1,028) that overlapped these regions by at least one nucleotide and did not have a similarity match (BLASTp; e-value ≤10 -5 ) [112] with genes of C. elegans. Following filtering of the predicted genes by Annotation Edit Distance (AED <0.4) [113], the final set was inferred to contain 23,610 protein-coding genes. The predicted genes were represented by amino acid and cDNA sequences.
Each coding gene was assessed against the known KEGG Orthology (KO) term BLAST hits. These BLAST hits were clustered to a known protein family using the KEGG-BRITE hierarchy in a custom script. ES proteins were predicted using SignalP (version 4.0) [128] and TMHMM (version 2.0c) [122,129,130] and by BLASTp homology searching of the validated Signal Peptide Database [131] and of an ES database containing published proteomic data for A. suum [14], B. malayi [15]. C. elegans [116], and T. spiralis [16]. In the final annotation, proteins inferred from genes were classified based on a homology match (e-value cut-off, ≤10 -5 ) to: (i) a curated, specialist protein database, followed by (ii) the KEGG database, followed by (iii) the Swiss-Prot database, followed by (iv) the annotated gene set for a model organism, including C. elegans, followed by (v) a recognized, conserved protein domain based on InterProScan analysis. Any inferred proteins lacking a match (e-value cut-off, ≤10 -5 ) in at least one of these analyses were designated hypothetical proteins. The final annotated protein-coding gene set for H. contortus is available for download at WormBase [116] in nucleotide and amino acid formats.
Each set of the decontaminated and quality-filtered paired-end RNA-seq data was mapped to the set of cDNAs using Burrows-Wheeler Aligner software [135]. The numbers of mapped reads per individual gene were extracted using the program SAMtools [136]. The resultant read counts per developmental stage were used as input data for edgeR. Initially, the levels of differential transcription data were calculated by pairwise comparison of stages in the life cycle of H. contortus (for example, egg versus L1; L1 versus L2; L2 versus L3; L3 versus L4; L3 versus adult; L4 versus adult) and of sexes (L4 female versus L4 male; adult female versus adult male). Using edgeR dispersion factor zero, the genes were considered differentially transcribed if the logarithmic change in fold change (FC) compared with the normalized read count data was greater than or equal to 2 and the P-value was less than or equal to 10 -5 . The levels of differential transcription data were then calculated by pairwise comparisons between all free-living (egg, L1, L2. and L3) and parasitic (L4 and adult) stages. The genes were considered differentially transcribed, using edgeR-calculated common and genewise dispersion factors, if the FC compared with the normalized read count data was greater than or equal to 2 and the false discovery rate (FDR) was less than or equal to 0.05. To identify abundant sex-enriched genes in adult H. contortus, more stringent criteria (FC ≥ 4; FDR ≤ 0.05) were applied. The resultant differentially transcribed genes were subjected to genetic interaction network analysis [33,34], based on the pre-calculated, weighed interactions among C. elegans genes. Hubs with at least 10 interactions (degree ≥ 10) among different genes were considered significant.

Protein homology
Homologs between H. contortus and A. suum, B. malayi, C. elegans, and T. spiralis were inferred by comparison of all proteins by BLASTp (e-value ≤ 10 -5 ), pairing proteins based on reciprocal best hits, and inferring homologous groups from pairs using a custom script.

Prediction of essentiality, chokepoints, and drug targets
Essentiality was inferred by filtering C. elegans homologs (BLASTp; e-value ≤ 10 -5 ) [112] representing lethal phenotype in RNAi experiments listed in WormBase release WS222 [116]. The metabolic chokepoints were predicted from essential genes with a unique match to the combined identifier of KEGG pathway and KO group. KEGG pathways and KO groups were inferred from the KEGG database (BLASTp; e-value ≤10 -5 ). The molecules in metabolic chokepoints that satisfied Lipinski's rule of five in ChEMBL were identified from matches with target molecules (BLASTp; e-value ≤10 -30 ) in the ChEMBL database.
Additional bioinformatic and data analyses, and use of software for document preparation Data analysis was conducted in a Unix environment or Microsoft Excel 2007 using standard commands. Bioinformatic scripts required to facilitate data analysis were designed using mainly the Python 2.6 scripting language.

Data availability
The genomic sequence and gene predictions for H. contortus are available in WormBase. The genome sequence has also been deposited at DDBJ/EMBL/GenBank (accession number AUUS00000000), and genomic and RNA-seq reads in the NCBI short read archive (SRA; accession numbers SRP027504 and SRP026668, respectively).

Additional material
Additional file 1: Supplementary tables S1-S18. Authors' contributions RBG, BEC, and AJ produced H. contortus in sheep and isolated genomic DNA; BEC and NDY isolated RNA; BAW constructed RNA-seq libraries; IA optimized Illumina protocols for paired-end libraries. CTB, ACH, and JP devised khmer filtering; CTB and EMS ran it on quality-filtered genomic reads. EMS decontaminated all reads in silico. EMS and PKK performed genomic assembly and annotation; PKK, BEC, NDY, ARJ, and RSH performed transcriptomic analyses. TRG estimated the genome size of H. contortus by Feulgen image analysis densitometry. RBG, EMS, PKK, and PWS wrote the manuscript, with crucial contributions from NDY, ARJ, and inputs from XQZ, PRB, and other co-authors. RBG and PWS managed the project. All authors read and approved the final manuscript.