Analysis and functional classification of transcripts from the nematode Meloidogyne incognita

As an entrée to characterizing plant parasitic nematode genomes, 5,700 expressed sequence tags (ESTs) from the infective second-stage larvae (L2) of the root-knot nematode Meloidogyne incognita have been analyzed. In addition to identifying putative nematode-specific and Tylenchida-specific genes, sequencing revealed previously uncharacterized horizontal gene transfer candidates in Meloidogyne with high identity to rhizobacterial genes.

cycle. Embryos develop in a proteinaceous matrix extruded by the adult female, and hatch as second-stage larvae (L2) that move through the soil and invade the plant root. Within the root, the worm establishes a feeding site and undergoes three additional molts to become an adult. M. incognita is a mitotic parthenogenetic species. Males develop but appear to play no role in reproduction [3]. Females swell to a pear shape and are incapable of moving once committing to a root feeding site.
The Meloidogyne L2 larvae, the infective stage where the worm is away from the host plant (also referred to as second-stage juvenile in the literature), is more accessible than the rest of the life cycle, and is an interesting stage biologically with the worm completing multiple steps required for survival. On hatching from the eggshell, L2 worms are able to locate and migrate towards a potential host plant, penetrate the root behind its tip in the zone of elongation, and migrate intercellularly through the vascular cylinder by separating cells at the middle lamella [4]. The migration is enabled by a combination of stylet protrusion (mechanical force) and secretion of cell-wall-degrading enzymes from specialized glands [5][6][7][8]. Upon completion of migration, secretions from the nematode's glands, and potentially other cues, induce root cells to alter their development and gene expression, undergoing abnormal growth and repeated endomitotic rounds of replication to form a feeding site made up of giant cells [9,10]. The L2 feeds from the giant cells for 10-12 days, then ceases feeding and molts three times over the next two days to form the adult. L2 undergo significant change following establishment of the feeding site, including swelling of the body and a switch in gland activity from subventral to dorsal dominance [11].
Until recent years, molecular characterization of Meloidogyne genes has been limited [12,13], particularly because the species' obligate parasitic life cycle makes studies difficult. Both basic understanding of root-knot nematode biology and applied research toward new means of nematode control are now beginning to benefit from the rapid identification of transcribed genes in the species. The generation of expressed sequence tags (ESTs) by single-pass random sequencing of cDNA libraries is a powerful tool for rapid gene transcript identification in metazoans [14][15][16][17] including parasitic nematodes of humans and animals [18][19][20][21][22][23]. High-throughput projects on two dozen nematode species have now brought the total number of publicly available roundworm ESTs to nearly 400,000, with half the sequences coming from parasites [24][25][26][27]. As a part of these efforts, EST sequencing from plant parasitic nematodes is in progress [28] and pilot EST datasets from the root-knot nematode M. incognita and the cyst nematodes Globodera rostochiensis and G. pallida second-stage larvae have recently been analyzed [29,30].
Important to the characterization and understanding of these sequences is the creation and implementation of bioinformatics approaches (such as clustering, functional classification, similarity analysis) that can be applied uniformly across the ever-increasing multiple nematode datasets. We present here an analysis of 5,713 ESTs from M. incognita L2 including creation of NemaGene clusters to reduce sequence redundancy, identification of abundant transcripts, and functional classification of gene products based on assignments to InterPro domains, the Gene Ontology hierarchy, and KEGG biochemical pathways. Building on the availability of the complete genome sequence, gene homologs of the free-living nematode Caenorhabditis elegans [31] were identified for M. incognita clusters and correlated with known RNA interference (RNAi) phenotypes. Genes specific to plant parasitic nematodes (Tylenchida species) as well as prokaryotic-like horizontal gene transfer candidates were also examined.

Results and discussion
As part of a larger effort to examine expressed gene sequences from parasitic nematodes, we have generated and submitted to GenBank's EST database 5,713 ESTs from a M. incognita L2 library. Sequences, which include both 5 and 3 reads, averaged 481 nucleotides, resulting in 2.82 million submitted nucleotides. Here we present a first analysis applying semi-automated bioinformatics tools to genome data from a plant parasitic nematode, thereby laying the groundwork for more extensive analyses.

NemaGene cluster analysis
To reduce data redundancy, improve base accuracy and transcript length, and determine gene representation within the library, ESTs from the M. incognita L2 library were grouped by sequence identity into contigs and clusters by a method using Phrap and BLAST. 'Contig' member ESTs appear to derive from identical transcripts while 'cluster' members may derive from the same gene yet represent different transcript splice isoforms (that is, ESTs form contigs, contigs form clusters). Beginning with 5,713 traces, automated screens and manual inspection of misassembled contigs resulted in the elimination of 52 ESTs as potential chimeric sequences. The remaining 5,661 ESTs formed 1,798 contigs and 1,625 clusters. Clusters varied in size from a single EST (723 cases) to 77 ESTs (1 case) ( Figure 1). By eliminating data redundancy during contig building, the total number of nucleotides used for further analysis was reduced from 2.82 million to 1.99 million. To a first approximation, this project generated sequence from as many as 1,625 genes, for a new gene discovery rate of 29%, with only 13% of ESTs being singletons. This number may, however, overestimate gene discovery as a single gene could be represented by multiple non-overlapping clusters. While library redundancy reduces the number of new genes discovered, 65% of clusters still have 10 or fewer EST members. Such redundancy is desirable to increase base accuracy and transcript length within contigs. Additionally, 122 clusters have multiple contig members, revealing potential splice isoforms. Contig building was successful in significantly increasing the length of assembled transcript sequences from 481 ± 108 nucleotides for submitted ESTs alone to 611 ± 174 nucleotides for multi-member contigs. The longest sequence also increased from 780 to 2,353 nucleotides. Sampling of another 5,661 ESTs from the same source is estimated to result in the discovery of only 329 new clusters, a new gene discovery rate of only 6% (ESTFreq, W. Gish, personal communication). Further sampling will therefore await library normalization. This same clustering methodology is being applied to ESTs from other nematode species [32].

Transcript abundance and highly represented genes
The 25 most abundant EST clusters accounted for 18% of all ESTs generated. A high level of representation in a cDNA library generally correlates with high transcript abundance in the original biological sample [33], although artifacts of library construction can result in selection for or against representation of some transcripts. Transcripts abundantly represented in the library include genes encoding cytoskeleton proteins (such as myosin, actin, UNC-87, troponin T) and proteins that carry out core eukaryotic energetic and metabolic processes (for example ADP/ATP translocase, lactate dehydrogenase) (Table 1). Sixty-four ESTs had significant homology to the putative fatty-acid-binding protein Sec-2, confirming the abundant expression of this gene reported in L2 cDNA libraries from M. incognita [29] and the cyst nematodes G. rostochiensis and G. pallida cDNA [30]. Sec-2 is secreted by plant-parasitic nematodes at relatively high levels [34]. Several abundantly expressed genes are also horizontal gene transfer candidates (see below).

Functional classification based on Gene Ontology assignments
To categorize transcripts by putative function, we have utilized the Gene Ontology (GO) classification scheme [35,36]. GO provides a dynamic controlled vocabulary and hierarchy that unifies descriptions of biological, cellular and molecular functions across genomes. InterProScan was used to match Meloidogyne clusters to characterized protein domains (5,875 entries) in the InterPro database [37]. Existing mappings of InterPro domains allowed placement of Meloidogyne clusters into the GO hierarchy, viewed locally with the  1  4  7  10  13  16  19  22  25  28  31  34  37  40  43  46  49  52  55  58  61  64  67  70  73  76  79 Cluster size   proteins (InterPro domain IPR001283) and showed homology to the genes vap-1 from H. glycines and Mi-msp-1 from M. incognita [40,41], both venom allergen antigen 5 family members with homologs in numerous nematodes including hookworms and C. elegans [42]. Categories that particularly contributed to the abundance of ligand-binding/carrier mappings for Meloidogyne included EF-hand calcium binding (22 clusters), RNA recognition motif (18 clusters), and a variety of ATP-binding domains (20 clusters). Differences in the distribution of GO mappings may be attributable to the more extensive stage representation available for the other species. Comparisons of relative expression levels for genes among different M. incognita stages will begin to be possible as EST collections from other life-cycle stages are generated and analyzed.

Functional classification based on KEGG analysis
As an alternative method of categorizing clusters by biochemical function, clusters were assigned to metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes database (KEGG [43]) using enzyme commission (EC) numbers as the basis for assignment. EC numbers were assigned to 258 clusters (16% of total), of which 176 (11%) had mappings to KEGG biochemical pathways (361 total and 212 unique mappings). Out of 82 possible metabolic pathways 56 were represented (Table 4). For a complete listing of KEGG mappings see Additional data files. Pathways well represented by the M. incognita clusters include: glycolysis/gluconeogenesis (10 enzymes represented), citrate cycle (7), fatty-acid metabolism and biosynthesis (11), pyrimidine metabolism (7), lysine degradation (8), arginine and proline metabolism (8) and tryptophan metabolism (8). Lysine, arginine and tryptophan are essential amino acids in C. elegans whereas proline is not [44]. Pathways not represented in Meloidogyne include alkaloid biosynthesis II and riboflavin (vitamin B2) metabolism. C. briggsae is incapable of synthesizing riboflavin [45] but C. elegans does appear to have a homolog of a riboflavin kinase (R10H10.6) and M. incognita may have at least one enzyme involved in riboflavin processing (see below).
Nematodes are believed to be unique among animals in utilizing the glyoxylate cycle to generate carbohydrates from the beta-oxidation of fatty acids (reviewed in [46]). The   [41]. These ESTs and their corresponding cDNA clones will be useful reagents for the further study of the glyoxylate pathway in different stages of the Meloidogyne life cycle. For instance, energy metabolism would be expected to change markedly upon plant invasion and intracellular migration toward the feeding site, and might include a decrease in expression of transcripts specific to the glyoxylate pathway. Figure 3 is a Venn diagram combining the results of BLAST searches versus three databases for the 79% (1,280/1,625) of M. incognita clusters which had matches to sequences from other species. Strikingly, in the majority of cases where homologies were found (740/1,280), matches were found in all three of the databases surveyed -C. elegans proteins, other nematode sequences, and non-nematode sequences. Gene products in this category are generally widely conserved across metazoans and many are involved in core biological processes. This category should continue to expand as additional complete genomes become available [50,51].

Distribution of BLAST database matches and homologs in C. elegans
The 20% of contigs (353) that had no homology may contain novel or diverged amino-acid coding sequences that are specific to Meloidogyne species or even to M. incognita only. Alternatively, clusters which containing mostly 3 or 5 untranslated regions (UTRs) would lack BLASTX homology because they are non-coding or contain too short a coding sequence to result in significant homology. To examine this latter possibility contig consensus sequences with and without BLASTX homology were examined to determine their longest open reading frame (ORF). The distribution of ORF sizes indicates that clusters without homology contain two populations; one population of novel protein-coding sequences with a similar distribution of ORF sizes to that found in sequences with homology, and a second population     In contrast to these findings for M. incognita where most clusters had homology, BLAST searches with EST clusters from the filarial nematode B. malayi showed far fewer database matches with the same e-value cut-off of 10 -5 [52] -57% versus 79%. Part of this difference is due to the use of more extensive databases in the M. incognita search. For instance, the Meloidogyne search included all dbEST sequences in the 'other nematode' set, resulting in matches for 61% of all clusters, whereas the Brugia search used only protein sequences in GenBank and saw matches in only around 12% of cases. However, even matches in C. elegans were fewer for B. malayi (50% versus 67%), where nearly identical databases were used. Brugia, Meloidogyne and Caenorhabditis represent three separate major nematode clades (III, IV and V, respectively) [53]. Possible explanations for the discrepancy in matches are that the Brugia clusters contain a large fraction of non-coding sequences (that is, 5 and 3 UTR, unspliced introns) or have undergone more rapid molecular evolution and diversification. Alternatively, since the Brugia ESTs derive from 12 different libraries they may represent rarer transcripts than are contained in the M. incognita collection.
A correlation between stage of expression and molecular conservation has been observed in C. elegans [54].
As expected, the C. elegans genome [31] was the best source of information for interpreting M. incognita sequences with 85% of all clusters with matches showing homology to a C. elegans gene product ( Figure 3). Table 5 presents the 15  gene products with the highest level of conservation (e-240 to e-115) between M. incognita and C. elegans; these include gene products involved in cell structure (for example, actin, myosin), protein biosynthesis (for example, ribosomal proteins) and glycolysis (for example, lactate dehydrogenase, enolase). Representation of these clusters in the M. incognita L2 EST collection varied from common (77 ESTs Nematodes process many mRNAs by trans-splicing to SL1 and other splice leader sequences [57,58] and in C. elegans use of different splice leaders is tied to genome organization in operons [59]. SL1 is the predominant nematode splice leader and is highly conserved across many species. Use of SL1 by transcripts is estimated at 70% in C. elegans [60], more than 80% in Ascaris lumbricoides [61], and approximately 60% in G. rostochiensis (Ling Qin, personal communication). SL1 has previously been observed in M. incognita [12], although genes with non trans-spliced 5 ends have also been cloned [5,6]. Only 33 of our M. incognita contigs have an SL1 sequence at their 5 end. This limited detection of SL1 is not surprising as both the poor processivity of reverse transcriptase and the positioning of the vector sequence primer near the beginning of the insert result in low representation of the initial 5 nucleotides of a transcript among EST collections. As an alternative method of determining which M. incognita genes may have an SL1 splice leader, contigs were compared by BLASTN to our recently sequenced ESTs from a M. arenaria egg library produced by PCR with an SL1 primer sequence. Of the M. incognita contigs 188 had high-level nucleotide identity (better than 1e-30) to this collection of SL1-containing Meloidogyne genes. With ESTs now available in our collection from four Meloidogyne and numerous other SL1-PCR cDNA libraries [32], it should be possible to address whether or not SL1-splicing of individual genes is conserved across nematode species.

Comparison to C. elegans genes with known RNAi phenotypes
The technique of RNAi, whereby the introduction of a sequence-specific double-stranded RNA leads to degradation of matching mRNAs [62], has allowed the systematic surveying of thousands of C. elegans genes for phenotypes following transient gene knockout [63][64][65]. Such information is potentially transferable to understanding which genes have crucial roles in parasitic nematodes where highthroughput RNAi is not yet possible. A list of 7,212 C. elegans RNAi experiments surveying 4,786 genes was compared to the list of all M. incognita clusters with significant homology to C. elegans proteins. Using the criterion that the C. elegans gene was the best match available for one of the Meloidogyne clusters and RNAi experimental information was available, 539 genes were revealed. A specific phenotype by RNAi was apparent for 221 (41%) of these genes, whereas 318 (59%) remained wild type (see Additional data files for the complete list of C. elegans RNAi phenotypes for genes with M. incognita homologs). By comparison, RNAi surveys of all predicted genes on a C. elegans chromosome have found a smaller percentage of genes with phenotypes: 14% for chromosome I [63] and 13% for chromosome III [64]. Surveys of expressed genes reveal an intermediate level of 27% with phenotypes [65]. Further, selecting for C. elegans genes with expressed Meloidogyne homologs led to enrichment for genes with severe phenotypes by RNAi such as embryonic lethality or sterility as compared to the overall dataset ( Figure 5) (For a complete tally of all observed phenotypes see Additional data files). A correlation between sequence conservation and severe phenotype by RNAi had previously been shown by comparison of C. elegans to genomes from the distant phyla Saccharomyces, Drosophila and human [63,64]. Here we show a similar trend following detection of homology to expressed genes in other nematode species. Applying RNAi techniques directly to parasitic nematodes is challenging owing to the organisms' generally longer and more complex life cycles, including the requirement for passage through a host organism. Progress has been made recently in assaying RNAi effects in both plant [66] and animal [67] parasitic nematodes. Further success may allow for a more high-throughput examination of phenotypes resulting from transient gene knockout in parasites.

Tylenchida-specific genes and horizontal gene transfer candidates
Fifty-three M. incognita clusters showed homology to sequences from other nematode species yet lacked either C. elegans or non-nematode homologs. Twenty of these clusters showed conservation only to gene products from other Tylenchida species. MI00244.cl, for example, had homology to 47 ESTs in our collection from other Tylenchida species including root-knot nematodes M. javanica, M. hapla and M. arenaria, cyst nematodes H. glycines and G. rostochiensis, and the lesion nematode Pratylenchus penetrans with E-values from 7e-78 to 3e-05. The best homology to any C. elegans protein was an extremely weak match (E-value = 0.017) to hypothetical protein M01H9.3b. Genes in this collection may be rapidly evolving so that homologs are only detected in closely related species. Alternatively, genes may be special adaptations to plant parasitism. No annotation is available for any of these genes, but alignments with sequences from related species can define domains for further characterization.
MI01045.cl, the seventh largest Meloidogyne EST cluster, is a new horizontal gene transfer candidate with homology to nodL acetyltransferase from Rhizobium leguminosarum (1e-53). Nod factor is responsible for the induction of nodules in nitrogen-fixing plants and nodL has an essential role in Nod factor biosynthesis [73]. Experimental demonstration of a trans-spliced leader on the Meloidogyne nodL mRNA and   To identify further horizontal gene transfer candidates from the M. incognita EST clusters, the subset of clusters with homology to sequences in other Tylenchida and in nonnematodes but not in non-Tylenchida nematodes were examined. In addition to those sequences already characterized, four additional clusters of interest were identified. MI00109.cl shows homology to a group of hypothetical proteins from alpha-proteobacteria: Sinorhizobium meliloti NP_386252 (3e-44); Novosphingobium aromaticivorans ZP_00095448 (3e-38); Mesorhizobium loti NP_107072 (5e-37). The finding of multiple Tylenchida genes with close homologs in rhizobacteria suggests the possibility of horizontal transfer of cassettes of genes or multiple transfer events between nitrogen-fixing soil bacteria and plant parasitic nematodes. MI01406.cl and MI00267.cl show homology to two hypothetical proteins from the Actinomycetales -Amycolatopsis mediterranei CAC42207 (5e-29) and Streptomyces lavendulae AAD32751 (2e-24). Providing some clue to function, the clusters as well as the hypothetical proteins are more distant homologs (1e-05 to 1e-08) of a putative riboflavin aldehyde-forming enzyme from Agaricus bisporus, CAB85691 (D.C. Eastwood, GenBank direct submission, 2000), an annotation based on homology (5e-05) to the characterized enzyme from Schizophyllum commune [74]. A weak but common motif between all of the proteins is discernible. The only previous analysis of root knot nematodes ESTs [29] used 914 ESTs from M. incognita L2 without clustering and with non-automated assignment of genes to categories. The two datasets share some overlap, with 35% (316/914) of the previously analyzed ESTs finding matches in 16% (261/1,625) of the NemaGene clusters analyzed in this paper, many with strong homology (< 1e-40). This overlap was less than expected given the redundancy of the cDNA library analyzed here, at nearly 6,000 ESTs, and suggest that: first, libraries made by different methods are likely to result in different representation from an mRNA pool (either different genes or other portions of the same genes as a result of different 5 processivity); and second, that M. incognita L2 are likely to have a substantial number of unsampled messages awaiting generation of new libraries or library normalization. The semi-automated clustering, sequence homology searching and scripted assignment of sequences to functional categories presented here is a scalable approach to analysis that can be applied to larger datasets.

Conclusions
In addition to applying the approaches presented here to larger and more diverse datasets, further topics in Meloidogyne genome analysis have yet to be explored. The availability of ESTs representing different developmental stages of Meloidogyne will allow an examination of changes in gene representation between stages, and in turn an understanding of the relative importance of various metabolic processes at different stages of development. EST sequences and their corresponding clones can be further used to study relative expression level between stages and conditions using microarrays [75] and amplified fragment length polymorphism (AFLP) approaches [76]. Contig sequences within clusters can also be compared directly for evidence of alternative splicing, another feature which might correlate with developmental stage. Other topics where bioinformatics analysis of available ESTs can improve current knowledge of Meloidogyne molecular biology include the identification of secreted and transmembrane proteins through secretion signal sequence detection [77], the C. elegans all C. elegans with M. incognita homology % creation of a more accurate codon usage/bias and aminoacid usage tables [78], the identification of conserved genes and pathways used in dauer/infective stages across nematode species [79], the definition and study of nematodespecific domains [55], and improved phylogenies based on sampling from multiple genes [53].
While ESTs do not provide information on genome organization in Meloidogyne (no genome sequence or physical map is yet available), they can shed light on the organization of the C. elegans genome. For instance, C. elegans autosomes are organized into central regions dense with predicted genes, highly expressed genes and known mutants, whereas the chromosome arms contain more repetitive sequences and have a higher meiotic recombination rate [31,80]. By using the expanding collection of ESTs from nematodes at various evolutionary distances from C. elegans, the hypothesis that genes on the autosome arms are more rapidly evolving can be tested more systematically. Mapping of ESTs from other nematode species can also detect genes contained in the C. elegans genome yet not previously recognized, and therefore missing from Wormpep, as well as recognized genes where not all exons have been correctly predicted.
In conclusion, the 5,713 ESTs analyzed here in 1,625 clusters probably represent 6-10% of the genes in the M. incognita genome. This initial study, which will be expanded as further sequences are generated, demonstrates that EST generation is an effective method for the discovery of the new genes in plant parasitic nematodes. Further, functional categorization and comparison to known sequences allows the identification of important biological processes at specific developmental stages as well as unusual sequences, such as horizontal gene transfer candidates.

Source material and library production
To obtain M. incognita L2 larvae, a population of nematodes maintained on Rutgers tomato were harvested, eggs were isolated and hatched by standard protocols [81]. Briefly, galled roots were removed from sandy soil, rinsed, and shaken in 15% bleach for 3 min to break roots and free egg masses. Contents were filtered with a large excess of water through a No. 200 sieve to remove root and soil fragments, and a No. 500 sieve to retain nematode eggs. Decanted eggs in small volume were applied above a 40% sucrose solution in a 50 ml conical tube and spun at 2,000 rpm for 10 min. Eggs banded at the sucrose/water interface and were removed by pipette. Following rinsing, sucrose banding was repeated. Harvested eggs were hatched over 4 days on top of a moist filter paper barrier (3 Crown Shopmaster heavy-duty wipes). Hatched larvae migrated through the paper and were collected in a water-filled petri dish below. By microscopic examination, collected worms were predominantly live moving L2, but rare dead L2 and eggs could be found.
Total RNA was isolated from collected L2 by the Trizol method (Life Technologies, Gaithersburg, MD) with a yield of 380 g from around 1 ml of packed L2 worms. Poly(A) + RNA was isolated from total RNA using the Promega Isolation System II (Promega, Madison, WI) following the manufacturer's instructions with a yield of 4.04 g. The cDNA library (named Bird_Rao_Meloidogyne_incognita_J2) was constructed using the Zap Express cDNA Synthesis Kit and Gigapack III Gold Cloning Kit, 200403 (Stratagene, Cedar Creek, TX). Inserts were directionally cloned between an EcoRI site (5) and a XhoI site (3); however, sequencing indicates that ~22% of clones are in reverse orientation. The non-directionality of the library does not interfere with either clustering or homology detection as both orientations are examined. The titer of the non-amplified phage library was 70,000 recombinants. In preparation for high-throughput sequencing the pBK-CMV phagemid was excised in bulk from the Zap Express phage using the ExAssist Interference-Resistant Helper Phage protocol 211203 (Stratagene). Resulting plasmids were replicated in the helper phage-resistant host cell XLOLR with kanamycin selection. It is expected that the majority of messages in this whole-animal library derive from the tissues that make up most of the mass of the L2 animal including hypodermis/cuticle, intestine, muscle, esophageal and rectal gland, and esophagus/pharnyx [82].

Sequence production and dbEST submission
Clone processing and sequencing was performed as in Hiller et al. [83] with some modifications. Single bacterial colonies from the plasmid library were picked from agar trays into 384-well plates containing media, kanamycin, and 7% glycerol using a Q-bot robotic colony picker (Genetix, Christchurch, UK). Plates were incubated overnight at 37°C and stored at -80°C. To prepare template plasmid DNA from each sample, bacterial inoculates were transferred from 384well storage to 96-well growth blocks containing 1 ml medium per sample and grown overnight. All subsequent sample and reagent transfers were done using a stationary 96-channel Hydra (Robbins Scientific, Sunnyvale, CA). DNA isolation was performed using a fast and inexpensive microwave-based protocol [84]. Sequencing reactions using the T3 (5) primer employed BigDye terminator chemistry (Applied Biosystems, Foster City, CA) and the cycle sequencing reactions were performed with 96 x 4-block thermocyclers (MJ Research, Waltham, MA). Samples were loaded on ABI377 (96-lane slab gel) sequencers (Applied Biosystems).
Following gel image analysis and DNA sequence extraction, sequence data were processed in an automated pipeline to: assess EST quality; trim flanking vector sequences; mask repetitive elements; remove contaminated ESTs; identify similarities by BLAST; identify cloning artifacts; and determine which portion of the EST to submit [83]. The resulting sequences were annotated with similarity information and sequence quality information and submitted to dbEST. Clones are named for their 96-well plate identity and position during processing (for instance ra40e04.y1). Names are mapped to stored clone location in 384-well plate format. Clones can be ordered at [85]. From 7,818 attempted reads, 5,854 sequences (75%) passed quality and contamination filters and were submitted to dbEST [86]. Most submissions (5,713)  To estimate the number of 5 versus 3 reads, we examined the 4,198 ESTs with detectable homology on either sense or antisense strands at time of submission (BLASTX search versus the SWIR non-redundant protein database, Sanger Centre). Most ESTs (78%) showed translated amino-acid homology consistent with sequencing from the 5 end of the transcript, while 22% showed homology consistent with 3 end sequencing. The mean submitted read length was 481 nucleotides with a standard deviation of 108. Longest and shorted submitted reads were 49 and 780, respectively. Since our submission filter includes a quality cut-off at the distal end of the read (Phred Score < 12 [87,88]), additional sequence can sometimes be obtained by direct examination of the sequencing trace available at [89].

Clustering for NemaGene Meloidogyne incognita v 2.0
Clustering was performed by first building 'contigs' of ESTs with identical or nearly identical overlapping sequence and second, by bringing together related contigs to form 'clusters'. Contig member ESTs should all derive from identical transcripts whereas cluster members might derive from the same gene yet represent different transcript splice isoforms or transcripts from multigene families with extremely high sequence identity. The raw traces for submitted ESTs were base-called using Phred [87] and assembled to form contigs using Phrap (P. Green, personal communication). Although Phrap is a program intended for genome assembly, it has been applied previously to ESTs with modifications [90]. To determine initial assembly quality, the largest contigs were inspected using the assembly viewer Consed [91]. Misassemblies bringing unrelated ESTs together into giant contigs usually resulted from the alignment of long poly(A) tails. To eliminate these assemblies of otherwise dissimilar ESTs, Phrap parameters (forcelevel 1, minmatch 20 and minscore 100) were adjusted and Phrap was rerun.
Once acceptable assembly parameters were obtained, Phrap was run to generate a first-draft assembly. Contigs with only one member EST (singletons) were removed from consideration until the trimming and cluster building stage. All contigs with more than three member ESTs was screened for misassemblies using Consed tools and newly written scripts. Misassemblies were recognized by: regions of high quality unaligned sequence; multiple runs of poly(A) and/or poly(T) (at least 15 nucleotides with no more than a one non-A/T base); internal poly(A) and/or poly(T) runs (> 50 nucleotides from either end of a contig and Ն 15 or more nucleotides long with no more than one non-A/T base; internal stretches of low consensus quality (> 30 nucleotides from either end of a contig and Ն 50 nucleotides where 90% of the nucleotides had a consensus quality below Phred 20). Contigs flagged for possible misassembly were manually edited in Consed and potentially chimeric ESTs and other suspect ESTs were identified and removed from the pool of traces. Chimerism can result from multiple-insert cloning or mistracking of sequence gel lanes. The project was reassembled with Phrap and screened again as above. All contigs with more than three members were examined again in Consed to eliminate additional misassemblies not resolved by the initial screens. In total, around 450 contigs were examined manually and around 200 were edited. For each contig, a consensus sequence of all EST members was generated. Contigs (now including singleton EST contigs) were then trimmed to high quality and any internal consensus position with a calculated quality value below 12 was changed to an N (unknown base).
Following the creation of contigs by Phrap, the contig consensus sequences were compared using WU-BLASTN (G = 2 E = 1 v = 100 F = F) [92,93] and grouped on the basis of similarity to form clusters of related contigs. Contigs with overlaps of 100 bases or more with nucleotide-nucleotide identities of 93% or more were clustered together. For further analysis, new assemblies based on clusters were not formed; rather, each cluster retained all the consensus sequences of its contig members. NemaGene Meloidogyne incognita v 2.0 represents our second complete attempt at generating clusters for this species and is used as the basis for all subsequent analysis in this manuscript. Scripts have been written to allow the addition of new data while retaining the original contig and cluster naming scheme. Additional NemaGene versions of M. incognita will be built as additional ESTs become available for the species. A comparison of the NemaGene clustering approach to other EST clustering methods will be considered in a separate manuscript. NemaGene Meloidogyne incognita v 2.0 is available for searching at [94] and FTP at [95].

Sequence analysis
Following clustering, comparative analyses were performed using WU-BLASTX and WU-TBLASTX [92,93] with 1,798 contig consensus sequences (themselves grouped into 1,625 cluster groups) as queries versus multiple databases including SWIR v.21 (5/19/2000) non-redundant protein database and Wormpep v.54 C. elegans protein database (Wellcome Trust Sanger Institute, unpublished work), C. elegans mitochondrial protein sequences, and six internally constructed databases using intersections of data from the GenBank nucleotide database and dbEST [96]. These include: nemnoele (all nucleotide data from the phylum Nematoda with C. elegans removed); nemnoelenomi (nemnoele with M. incognita removed); nemnoelenomel (nemnoele with all Meloidogyne species removed); nemnoelenotyl (nemnoele with all Tylenchida species removed); yestylnomel (all Tylenchida species except Meloidogyne); mj (only M. javanica sequences). An additional database, nrnonem, is an amino-acid database of all non-nematode proteins derived from SWIR v.21. WU-BLASTX (translated nucleotide query versus protein database) parameters were S = 100 M = PAM120 V =0 W = 4 T = 17. WU-TBLASTX (translated nucleotide query versus translated nucleotide database, each in all six reading frames) parameters were Q = 10 R = 2 gapw = 10. Homologies were reported for e-value scores of 1e-5 and better. By creating intersections of various database search results, contigs/clusters could be organized by their distribution of homologies (for example, clusters which have M. javanica matches but not C. elegans matches). Data analysis was performed in a Unix environment using Perl and Bourne shell scripts. The program ESTFreq (W. Gish, personal communication) was used to estimate novel sequences expected from a second sampling and the program Translate (S. Eddy, personal communication) was used to translate nucleotide consensus sequences for ORF analysis.

Functional assignments
To assign putative functions to clusters, the integrated protein domain recognition program InterProScan [97,98] was run locally to search translated contig consensus sequences versus all InterPro protein domains (as of 2 April 2002) [99]. The Prosite, Prints, Pfam, ProDom, and Smart search components of InterProScan were used with default parameters. The GO categorization scheme (go_200205assocdb.sql) of classification by biological process, cellular component, and molecular function was used to classify clusters based on the existing mappings of InterPro domains to the GO hierarchy [36]. Mappings were stored in a local MySQL database and displayed using the AmiGO browser (16 May 2002) [100] (M. incognita mappings at [101]).
As an alternative means of assigning function to clusters, clusters were also assigned to metabolic pathways using KEGG [102,103]. Assignments were made by requiring that the highest-scoring BLAST match in SWIR v.21 have an assigned enzyme commission (EC) number [104]. EC number mappings to KEGG pathways were then used to putatively assign clusters into biochemical pathways. Nonspecific pathway mappings (for example, kinases, EC 2.7.1.-) were eliminated, as were misleading pathway assignments (for example, plant carbon fixation, KEGG 2.3, where the assigned protein had only a peripheral 'feed-in' role in the pathway). Assignments were not made to KEGG regulatory pathways as proteins in these pathways lack EC numbers.

C. elegans homologs with RNAi phenotype
To identify cases where M. incognita and C. elegans share homologous genes which have been surveyed in C. elegans for knockout phenotype using RNAi, a list of all 7,212 available C. elegans RNAi experiments (5 May 2002) from WormBase [56] was compared to the list of all M. incognita clusters with significant homology matches to the C. elegans Wormpep v.54 protein database. Redundant RNAi experiments were removed to consolidate the WormBase list to 6,107 and experiments performed on the same gene with different phenotypic outcomes were consolidated later. For any given M. incognita cluster, only the best C. elegans matches, ranked by BLAST score, were considered.

Nematode origin of the cDNA sequences
To insure that sequences generated originate from M. incognita and are not contaminants, multiple steps purifying material and cross-checking sequence origin have been incorporated into the project: the starting material is purified and freed of plant material; poly(A) selection during library production is highly selective for eukaryotic transcripts, though it is possible for AT-rich prokaryotic transcripts to be cloned; analyzed sequences have been filtered for prokaryotic homology resulting in the removal of eight E. coli contaminants (0.14%), a typical background for cDNA cloning; 96% of the clusters with detectable homology have nematode homologs (1,227/1,280), 17% have only nematode homology, and in the vast majority of cases, higher conservation is seen to a nematode sequence than any non-nematode sequence; additional confirmation of nematode origin comes from the presence of an SL1 trans-spliced leader sequence on some genes; all sequences with strong aminoacid homology to prokaryotic genes were closely examined and in no cases were the high levels of identity maintained at the nucleotide level (as would be the case with a contaminating sequence). While it can be stated with confidence that the vast majority of the sequence analyzed originates from M. incognita, we cannot rule out the possibility that the collection may include a very small number of contaminating sequences.

Additional data files
The following files are available as additional data with the online version of this article: a complete listing of gene ontology mappings for M. incognita clusters organized into (a) biological process, (b) cellular component, (c) molecular function (Additional data file 1); complete KEGG biochemical pathway mappings for M. incognita clusters (Additional data file 2); a complete list of C. elegans RNAi phenotypes for genes with M. incognita homologs (Additional data file 3); classification by RNAi phenotype of C. elegans genes with M. incognita homologs (Additional data file 4).