The ASRG database: identification and survey of Arabidopsis thaliana genes involved in pre-mRNA splicing

The database of Arabidopsis splicing related genes includes classification of genes encoding snRNAs and other splicing related proteins, together with information on gene structure, alternative splicing, gene duplications and phylogenetic relationships.


Rationale
Most eukaryotic genes contain introns that are spliced from the precursor mRNA (pre-mRNA). The correct interpretation of splicing signals is essential to generate authentic mature mRNAs that yield correct translation products. As an important post-transcriptional mechanism, gene function can be controlled at the level of splicing through the production of different mRNAs from a single pre-mRNA (reviewed in [1]). The general mechanism of splicing has been well studied in human and yeast systems and is largely conserved between these organisms. Plant RNA splicing mechanisms remain comparatively poorly understood, due in part to the lack of an in vitro plant splicing system. Although the splicing mechanisms in plants and animals appear to be similar overall, incorrect splicing of plant pre-mRNAs in mammalian systems (and vice versa) suggests that there are plant-specific characteristics, resulting from coevolution of splicing factors with the signals they recognize or from the requirement for additional splicing factors (reviewed in [2,3]).
Genome projects are accelerating research on splicing. For example, with the majority of splicing-related genes already known in human and budding yeast, these gene sequences were used to query the Drosophila and fission yeast genomes in an effort to identify potential homologs [4,5]. Most of the known genes were found to have homologs in both Drosophila and fission yeast. The availability of the near-complete genome of Arabidopsis thaliana [6] provides the foundation for the simultaneous study of all the genes involved in particular plant structures or physiological processes. For example, Barakat et al. [7] identified and mapped 249 genes encoding ribosomal proteins and analyzed gene number, chromosomal location, evolutionary history (including large-scale chromosomal duplications) and expression of those genes. Beisson et al. [8] catalogued all genes involved in acyl lipid metabolism. Wang et al. [9] surveyed more than 1,000 Arabidopsis protein kinases and computationally compared derived protein clusters with established gene families in budding yeast. Previous surveys of Arabidopsis gene families that contain some splicing-related genes include the DEAD box RNA helicase family [10] and RNA-recognition motif (RRM)-containing proteins [11]. At present, the Arabidopsis Information Resource (TAIR) links to more than 850 such expert-maintained collections of gene families [12].
Here we present the results of computational identification of potentially all or nearly all Arabidopsis genes involved in pre-mRNA splicing. Recent mass spectrometry analyses revealed more than 200 proteins associated with human spliceosomes ( [13][14][15][16][17], reviewed in [18]). By extensive sequence comparisons using known plant and animal splicing-related proteins as queries, we have identified 74 small nuclear RNA (snRNA) genes and 395 protein-coding genes in the Arabidopsis genome that are likely to be homologs of animal splicingrelated genes. About half of the genes occur in multiple copies in the genome and appear to have been derived both from chromosomal duplication events and from duplication of individual genes. All genes were classified into gene families, named and annotated with respect to their inferred gene structure, predicted protein domain structure and presumed function. The classification and analysis results are available as an integrated web resource, the database of Arabidopsis Splicing Related Genes (ASRG), which should facilitate genome-wide studies of pre-mRNA splicing in plants.

ASRG: a database of Arabidopsis splicing-related genes
Our up-to-date web-accessible database comprising the Arabidopsis splicing-related genes and associated information is available at [19]. The web pages display gene structure, alternative splicing patterns, protein domain structure and potential gene duplication origins in tabular format. Chromosomal locations and spliced alignment of cognate cDNAs and expressed sequence tags (ESTs) are viewable via links to the Arabidopsis genome database AtGDB [20], which also provides other associated information for these genes and links to other databases. Text-search functions are accessible from all the web pages. Sequence-analysis tools including BLAST [21] and CLUSTAL W [22] are integrated and facilitate comparison of splicing-related genes and proteins across various species.

Arabidopsis snRNA genes
A total of 15 major snRNA and two minor snRNA genes were previously identified experimentally in Arabidopsis [23][24][25][26][27][28]. These genes were used as queries to search the Arabidopsis genome for other snRNA genes. A total of 70 major snRNAs and three minor snRNAs were identified by this method. In addition, a single U4atac snRNA gene was identified by sequence motif search. We assigned tentative gene names and gene models as shown in Table 1, together with chromosome locations and similarity scores relative to a representative query sequence. The original names for known snRNAs were preserved, following the convention atUx.y, where x indicates the U snRNA type and y the gene number. Computationally identified snRNAs were named similarly, but with a hyphen instead of a period separating type from gene number (atUx-y). Putative pseudogenes were indicated with a 'p' following the gene name. Pseudogene status was assigned to gene models for which sequence similarity to known genes was low, otherwise conserved transcription signals are missing and the gene cannot fold into typical secondary structure. A recent experimental study of small non-messenger RNAs identified 14 tentative snRNAs in Arabidopsis by cDNA cloning ( [29], GenBank accessions 22293580 to 22293592 and 22293600, Table 1). All these newly identified snRNAs were found in the set of our computationally predicted genes.

Conservation of major snRNA genes
As shown in Table 1, each of five major snRNA genes (U1, U2, U4, U5 and U6) exists in more than 10 copies in the Arabidopsis genome. U2 snRNA has the largest copy number, with a total of 18 putative homologs identified. Both U1 and U5 snRNAs have 14 copies, U6 snRNA has 13 copies, and U4 snRNA has only 11 copies. Sequence comparisons within Arabidopsis snRNA gene families showed that the U6 snRNA genes are the most similar, and the U1 snRNA genes are the most divergent. Eight active U6 snRNA copies are more than 93% identical to each other in the genic region, whereas active U1 snRNAs are on average only 87% identical. The U2 and U4 snRNAs are also highly conserved within each type, with more than 92% identity among the active genes. Details about the individual snRNAs and the respective sequence alignments are displayed at [30].
Previous studies identified two conserved transcription signals in most major snRNA gene promoters: USE (upstream sequence element, RTCCACATCG (where R is either A or G) and TATA box [24][25][26][27]. All 14 U5 snRNAs have the USE and TATA box. Furthermore, their predicted secondary structures are similar to the known structure of their counterparts in human, indicating that all these genes are active and functional (structure data not shown; for a review of the structures of human snRNAs, see [31]). Similarly, we identified 17 U2, 10 U1, nine U4, and nine U6 snRNA genes as likely active genes, with a few additional genes more likely to be pseudogenes because of various deletions. U4-10 and U6-7 do not have the conserved USE in the promoter region, but their U4-U6 interaction regions (stem I and stem II) are fairly well conserved. U2-16 is also missing the USE but has a secondary structure similar to other U2 snRNAs. These genes may be active, but differences in promoter motifs suggest that their expression may be under different control compared with other snRNAs homologs. The U2-17 snRNA has all conserved transcription signals, but 20 nucleotides are missing from its 3' end. The predicted secondary structure of U2-17 is similar to that of other U2 snRNAs, with a significantly shorter stem-loop in the 3' end as a result of the deletion. We are not sure if the U2-17 snRNA is functional, but the conserved transcription signals imply that it may be active.
Other conserved transcription signals were also identified in most active snRNAs, including the sequence element CAANTC (where N is either A, C, G or T) in U2 snRNAs (located at -6 to -1) [23], and the termination signal CAN 3-  [23,24,32]. The previously identified monocot-specific promoter element (MSP, RGCCCR, located upstream of USE) in U6.1 and U6.26 [33] is also found in five other U6 snRNA genes (U6.29, U6-2, U6-3, U6-4, U6-5). In all seven U6 snRNAs the consensus MSP sequence extends by two thymine nucleotides to RGCCCRTT. Although the MSP does not contribute significantly to U6 snRNA transcription initiation in Nicotiana plumbaginifolia protoplasts [33], the extended consensus may imply a role in gene expression regulation in Arabidopsis.

Low copy number of minor snRNA genes
The minor snRNAs are functional in the splicing of U12-type (AT-AC) introns. Four types of minor snRNAs, which correspond to four types of major snRNAs, exist in mammals. U11 is the analog of U1, U12 is the analog of U2, U4atac is the analog of U4, and U6atac is the analog of U6. The U5 snRNA seems to function in both the major and minor spliceosome [34]. Two minor snRNAs (atU12 and atU6atac) were experimentally identified in Arabidopsis [28]. Both have the conserved USE and TATA box in the promoter region. We identified another U6atac gene (atU6atac-2) by sequence mapping. This gene has a USE and a TATA box in the promoter region. The atU6atac-2 gene is more than 90% similar to atU6atac in both its 5' and 3' ends, with a 10-nucletotide deletion in the central region. The putative U4atac-U6atac interaction region in atU6atac-2 is 100% conserved with the interaction region previously identified in atU6atac [28,35].
U11 and U4atac have not been experimentally identified in Arabidopsis. BLAST searches using the human U11 and U4atac homologs as queries against the Arabidopsis genome failed to find any significant hits, indicating divergence of the minor snRNAs in plants and mammals. Using the strategy described below, we successfully identified a putative Arabidopsis U4atac gene. It is a single-copy gene containing all conserved functional domains. We also found a single candidate U11 snRNA gene (chromosome 5, from 17,492,101 to 17,492,600) that has the USE and TATA box in the promoter region. This gene also contains a putative binding site fr Sm protein and a region that could pair with the 5' splice site of the U12-type intron.

Identification of an Arabidopsis U4atac snRNA gene
Like U4 snRNA and U6 snRNA, human U4atac and U6atac snRNAs interact with each other through base pairing [36]. The same interaction is expected to exist between the Arabidopsis homologs. Therefore, we deduced the tentative AtU4atac stem II sequence (CCCGTCTCTGTCAGAGGAG) from AtU6atac snRNA and searched for matching sequences in the Arabidopsis genome. Hit regions together with flanking regions 500 base-pairs (bp) upstream and 500 bp downstream were retrieved and screened for transcription signals Chromosomal locations were determined by conducting BLAST searches against the Arabidopsis genome (Release 5.0). *The gene used for query in the BLAST search; † atU12 and atU6atac sequences, which were experimentally identified [28]. Their sequences were compiled manually from the cited paper. The GenBank gi numbers for the chromosome sequences used are as follows: chromosome 1, 42592260; chromosome 2, 30698031; chromosome 3, 30698537; chromosome 4, 30698542; chromosome 5, 30698605. (USE and TATA box). One sequence was identified that contains both the USE and TATA box in appropriate positions, as shown in Figure 1.
The tentative U4atac snRNA gene contains not only the stem II sequence, but also the stem I sequence that presumably base-pairs with U6atac snRNA stem I. Furthermore, a highly conserved Sm-protein-binding region exists at the 3' end. The predicted secondary structure is nearly identical to hsU4atac, with a relative longer single-stranded region (data not shown). With the highly conserved transcriptional signals, functional domains and secondary structure, this candidate gene is likely to be a real U4atac snRNA homolog. We named it AtU4atac and assigned At4g16065 as its tentative gene model because it is located between gene models At4g16060 and At4g16070 on chromosome 4.

Arabidopsis splicing-related protein-coding genes
Most of the proteins involved in splicing in mammals and Drosophila are known [4,37,38]. In addition, recent proteomics studies revealed many novel proteins associated with human spliceosomes (reviewed in [18]). Using all these animal proteins as query sequences, we identified a total of 395 tentative homologs in Arabidopsis. Sequence-similarity scores and comparison of gene structure and protein domain structure were used to assign the genes to families. Each gene was assigned a tentative name based on the name of its respective animal homolog. Different homologs within a gene family were labeled by adding an Arabic number (1, 2, and so on) to the name. Close family members with similar gene structure were indicated by adding -a, -b, and -c to the name. The 395 genes were classified into five different categories according to the presumed function of their products. Ninetyone encode small nuclear ribonucleoprotein particle (snRNP) proteins, 109 encode splicing factors, and 60 encode potential splicing regulators. Details of EST evidence, alternative splicing patterns, duplication sources and domain structure of these genes are listed in Table 2. We also identified 84 Arabidopsis proteins corresponding to 54 human spliceosomeassociated proteins. The remaining 51 genes encode proteins with domains or sequences similar to known splicing factors, but without enough similarity to allow unambiguous classification. These two categories are not discussed in detail here, but information about these genes is available at our ASRG site [39].

The majority of snRNP proteins are conserved in Arabidopsis
There are five snRNPs (U1, U2, U4, U5 and U6) involved in the formation of the major spliceosome, corresponding to five snRNAs. Five snRNPs (U1 snRNP, U2 snRNP, U5 snRNP, U4/U6 snRNP and U4.U6/U5 tri-snRNP) have been isolated experimentally in yeast or human [40][41][42][43][44][45]. Each snRNP contains the snRNA, a group of core proteins, and some snRNPspecific proteins. Most of these proteins are conserved in Arabidopsis. All U snRNPs except U6 snRNP contain seven common core proteins bound to snRNAs. These core proteins all have an Sm domain and have been called Sm proteins. The U6 snRNP contains seven LSM proteins ('like Sm' proteins).
Another LSM protein (LSM1) is not involved in binding snRNA (reviewed in [46]). Table 2, all Sm and LSM proteins have homologs in Arabidopsis, and eight of them are duplicated. It is likely that these genes existed as single copies in the ancestor of animals and plants, but duplicated within the plant lineage. Only one of the 24 genes (LSM5, At5g48870) has been characterized experimentally in Arabidopsis. The LSM5 gene was cloned from a mutant supersensitive to ABA (abscisic acid) and drought (SAD1 [47]). LSM5 is expressed at low levels in all tissues and its transcription is not altered by drought stress [47]. cDNA and EST evidence exist for all other core protein genes, indicating that all 24 genes are active.

As shown in
There are 63 Arabidopsis proteins corresponding to the 35 snRNP-specific proteins used as queries in our genome mapping. Very few of them have been characterized experimentally, including U1-70K, U1A and a tandem duplication pair of SAP130 [48][49][50]. U1-70K was reported as a single-copy essential gene. Expression of U1-70K antisense transcript under the APETALA3 promoter suppressed the development of sepals and petals [51]. We identified an additional homolog of U1-70K (At2g43370) and named it U1-70K2. The U1-70K2 proteins showed 48% similarity to the U1-70K protein according to Blast2 results. Both genes retain the sixth intron in some transcripts, a situation which would produce truncated proteins [48]. Interestingly, we found that five of the 10 Arabidopsis U1 snRNP proteins, including the U1-70K-coding genes, may undergo alternative splicing.
Several genes in U2, U5, U4/U6 and U4.U6/U5 snRNPs, but none in U1 snRNP, occur in more than three copies in the Arabidopsis genome. The atSAP114 family has five members, including two that occur in tandem (atSAP114-1a and atSAP114-1b). Three members have EST/cDNA evidence (Table 2). Interestingly, the predicted atSAP114p (At4g15580) protein contains a RNase H domain at the amino-terminal end, and thus atSAP114p shares similarity to At5g06805, a gene annotated as encoding a non-LTR retroelement reverse transcriptase-like protein. It is likely that the atSAP114p gene is a pseudogene that originated by retroelement insertion. There are three copies of the gene for the tri- AltA (1)       The yeast U1 snRNP contains several specific proteins that are not present in mammalian U1 snRNPs [52]. As in mammals, Arabidopsis also lacks homologs of Prp42, a component of U1 snRNP in yeast [53]. However, Arabidopsis has two copies of the gene for Prp39, which are similar to Prp42. Furthermore, atPrp39a can produce a shorter protein isoform with a novel amino-terminal sequence by exon skipping. It is possible that the duplicates and alternative isoforms of plant U1 snRNP proteins are functional homologs of the yeast-specific proteins.
Several proteins specific to the minor spliceosome are also conserved in Arabidopsis. The human 18S U11/U12 snRNP contains several proteins found in U2 snRNP as well as seven novel proteins [14]. Four of the seven U11/U12-specific proteins (U11/U12-35K, 25K, 65K and 31K) are conserved in Arabidopsis, while the remaining three (59K, 48K and 20K) have no clear homologs. Interestingly, all four Arabidopsis genes are single copy in the genome, and three of them are apparently alternatively spliced ( Table 2).

Splicing factors are slightly different in Arabidopsis than in other organisms
We divided the splicing factors into eight subgroups according to recent human spliceosome studies [13,14,16,18]: splicesite selection proteins; SR proteins; 17S U2 associated proteins; 35S U5 associated proteins; proteins specific to the B∆U1 complex; exon junction complex (EJC) proteins; second-step splicing factors and other known splicing factors. We focused our analysis on the first two subgroups because their functions in splicing are well established. A total of 109 proteins in Arabidopsis were identified, corresponding to 67 human queries from all eight subgroups. Most of the proteins are conserved among eukaryotes, but some human proteins have no obvious homologs in the Arabidopsis genome, and some novel splicing factors appear to exist in Arabidopsis. About 43% of genes encoding splicing factors are duplicated in the genome, whereas some proteins, such as SF1/BBP (branchpoint-binding protein, which facilitates U2 snRNP binding in fission yeast [54]) and cap-binding proteins (CBP20 and CBP80, possibly involved in cap proximal intron splicing [55]), derive from single-copy genes [56]. These single-copy gene products may work with all pre-mRNAs, including the ones with U12-type introns. Surprisingly, mutation of CBP80 (ABH1) is not lethal and is non-pleiotropic. The abh1 plants show ABA-hypersensitive closure of stomata and reduced wilting during drought [57].

Arabidopsis splicing-related proteins
Many splicing factors have been identified previously in Arabidopsis, including two U2AF65, two U2AF35, and 18 SR proteins [58][59][60][61][62][63][64][65][66][67]. The U2AF35-related protein atUrp, which could interact with U2AF65 and position RS-domain-containing splicing factors [68], is also present in the Arabidopsis genome. Although the Urp gene is expressed ubiquitously in human tissues, no ESTs from this gene were found in Arabidopsis. Three copies of PTB/hnRNP-I genes were identified in Arabidopsis. The PTB protein competes for the poly-pyrimidine tract with the U2AF large subunit, thus negatively regulating splicing [69].
We also identified a homolog related to atU2AF 65 (At2g33440) and an additional SR protein (At2g46610 The additional SR protein belongs to the atRSp31 family and was named atRSp32 (At2g46610). It shares 70% identity and 78% similarity with atRSp31. The protein is 250 amino acids in length and contains two RRMs and some RS dipeptides in the carboxy-terminal region. The gene structure of atRSp32 is similar to that of atRSp31. Two other genes (atRSp40 and atRSp41) are in the same family and also have similar exon and intron sizes (see gene structure information at [70]). Similarly to the previous classification of 18 SR proteins [61], the 19 SR proteins (including SR45) can be grouped into four large families of four to five members according to sequence similarity, gene structure and protein domain structure.
The atRSp31 family (atRSp31, atRSp32, atRSp40 and atRSp41) belongs to a novel plant SR family and has no clear animal ortholog. Other families include the SC35 (or SRrp/ TASR2) family, SF2/ASF family, and the 9G8 family. Arabidopsis has a single copy of the SC35 ortholog and four SC35like proteins (atSR33, atSCL30a, atSCL30 and atSCL28), which appear to have diverged significantly from SC35. It seems that this divergence predates the split of plants and animals because a similar SC35-like gene family exists in the human genome (SRrp35 and SRrp40). The SRrp35 and SRrp40 were found to antagonize other SR proteins in vitro and function in 5' splice-site selection [71]. SF2/ASF has four copies (atSR1/SRp34, atSRp30, atSRp34a and atSRp34b) with similar gene structures and domains. Human 9G8 protein has five homologs in Arabidopsis, with three (atRSZp21, atRSZp22 and atRSZp22a) containing one CCHC-type zinc finger and two (atRSZ33, atRSZ32) containing two CCHCtype zinc fingers in addition to an RRM and an RS domain. Interestingly, several SR proteins (atRSZp21, atRSZp22, SR45 and SCL33) were found to interact with atU1-70K, and some SR proteins can interact with each other, thus forming a complicated interaction network to facilitate splice-site selection and spliceosome assembly [3,[61][62][63]. atSR45 was initially regarded as a novel plant SR protein [63], but by virtue of sequence-similarity scores it actually may be the ortholog of the human RNPS1 gene, which encodes an EJC protein.
Other human SR proteins (SRp20, SRp30c, SRp40, SRp54, SRp55 and SRp75) lack clear orthologs in Arabidopsis. We conclude that SR protein families evolved differently in animals and plants from three to four common ancestors, including SC35, SF2/ASF and 9G8/RSZ. The SRrp (SC35-like in plants) family may not be classical SR proteins but they play important roles in splice-site selection.
Proteins in other subgroups, such as 17S U2 snRNP-associated proteins, 35S U5 snRNP-associated proteins, and protein specific to the B∆U1 complex, are also conserved in Arabidopsis. The B∆U1 complex is the spliceosome complex captured immediately before catalytic activation. Most proteins in the 35S U5 snRNP are absent in the B∆U1 complex but present in the active B complex, indicating the important roles of 35S U5 snRNP-associated proteins in spliceosome activation [13]. Conservation of these proteins in Arabidopsis revealed the same pathway of spliceosome activation in plants. A subcomplex named Prp19 complex in 35S U5 snRNP has a critical role in spliceosome activation [13,72]. All proteins in the human Prp19 complex have homologs in Arabidopsis, including a chromosomal duplication pair of Prp19 genes and a single copy of the CDC5 gene. For the B∆U1 complex, six human genes have homologs, and five of them are single copy in Arabidopsis. Two genes (NPW38BP/SNP70 and p220(NPAT)) in the human B∆U1 complex have no apparent Arabidopsis homologs.
Arabidopsis also lacks an SMN protein complex. In human, the SMN protein (survival of motor neurons) can interact with a series of proteins including Gemin2, Gemin3 (a helicase), Gemin4, Gemin5 and Gemin6 to form an SMN complex, which has important roles in the biogenesis of snRNPs and the assembly of the spliceosome through direct interactions with Sm proteins and snRNA [73]. Although the SMN protein exists in the fission yeast genome (GenBank accession CAA91173), no SMN complex members can be identified in the Arabidopsis genome.

Splicing regulators are expanded in Arabidopsis
Splicing regulators are proteins that can either modify splicing factors or compete with splicing factors for their binding site. Important splicing regulators are hnRNP proteins and SR protein kinases. The exact role of phosphorylation of SR proteins in splicing is not yet clear, but SR protein kinases are well conserved and exist as multiple copies in Arabidopsis. A total of eight SR protein kinases were identified in Arabidopsis, including three Lammer/CLK kinases (AFC1, AFC2 and AFC3), two SRPK1 homologs, and three SPRK2 homologs. The three Lammer/CLK kinases were identified previously, and AFC2 was shown to phosphorylate SR protein in vitro [63,74]. Overexpression of tobacco AFC2 homolog PK12 in Arabidopsis changed the alternative splice patterns of several genes, including atSRp30, atSR1/atSRp34 and U1-70K [75], indicating that these SR proteins may function to modulate splicing in plants.
The heterogeneous nuclear ribonucleoproteins (hnRNPs) bind to splice sites and to binding sites for splicing factors on nascent pre-mRNAs, thus competing with splicing factors to negatively control splicing (reviewed in [76]). Humans have about 20 hnRNP proteins, many of which function in splicing. A total of 35 potential hnRNP proteins possibly related to splicing was found in Arabidopsis by sequence-similarity searches, including a superfamily of glycine-rich RNA-binding proteins. This family contains 21 members similar to human hnRNP A1 and hnRNP A2/B1. It can be further divided into two subfamilies. One includes eight proteins containing one RRM, and another has 13 members with two RRMs. 12 of these proteins were identified previously, including AtGRP7, AtGRP8, UBA2a, UBA2b, UBA2c and AtRNPA/ B1-6 [11,77,78]. AtGRP7 was found to be able to influence alternative splicing of its own transcripts as well as AtGRP8 transcripts [79]. UBA2 proteins can interact with UBP1 and UBA1 proteins, which have three RRMs and one RRM respectively, to recognize U-rich sequences in the 3' untranslated region (UTR) and stabilize mRNA [78]. Although the overexpression of UBA2 did not stimulate splicing of a reporter gene in tobacco protoplasts [78], we cannot rule out the possibility that it could be involved in splicing of other genes.
Other human hnRNPs related to splicing also have homologs in Arabidopsis. BLAST searches of the human (CUG)n triplet repeat RNA-binding protein (CUG-BP) against all Arabidopsis proteins revealed three putative homologs, including atFCA. atFCA and CUG-BP share similarity within the RRMs and a region approximately 40 amino acids in length. An additional protein (At2g47310) related to FCA was identified and named FCA2, as it shares about 50% similarity with FCA. The FCA proteins have two RRMs and a WW domain, which interact with the FY protein, a homolog of yeast polyadenylation factor Psf2p [80,81]. The FCA-FY complex negatively regulates the FCA protein by favoring a polyadenylation site from the third intron of FCA pre-mRNA [80,82]. FCA may be a multifunctional protein involved in mRNA processing, as human CUG-BP can function in both alternative splicing and deadenylation [83]. We also list 15 previously identified hnRNP-like proteins and two additional homologs as possible splicing regulators. The UBP1 proteins can strongly enhance splicing of some introns in protoplasts [84], whereas UBA1, RBP45 and RBP47 proteins have no similar function [78,85].

Unclassified splicing protein candidates
In addition to the 260 proteins in the above three categories, there are also 84 Arabidopsis proteins corresponding to human spliceosome-associated proteins identified in recent proteomic studies [15][16][17][18]. Some of these proteins function in other processes, such as transcription, polyadenylation and even translation. Their association with spliceosomes provides evidence for the coupling of splicing and other processes. Other proteins have no known functions. Only 35.8% of the proteins in this category are duplicated in Arabidopsis. We also identified a total of 51 Arabidopsis proteincoding genes similar to known splicing proteins. They have conserved domains and some level of sequence similarity to known splicing factors. We did not include these two categories in Table 2, but detailed information about them is available at ASRG [39].

Distribution and duplication of Arabidopsis splicing-related genes
The distribution of Arabidopsis snRNA and splicing-related proteins across the genome is shown in Figure 2 and at the ASRG website. Overall, the genes appear evenly distributed on the chromosomes, with several small gene clusters. Only four snRNA genes are located on chromosome 2, three of which are U2 snRNA genes. No U4 snRNA gene is located on chromosome 4. For the protein-coding genes, most functional categories have members located on each chromosome. The only exception is the SR protein kinase family, which has no member on chromosome 1. Interestingly, chromosome 1 contains the most snRNP proteins and splicing factors, but has the fewest splicing regulators. Several gene clusters encoding splicing-related proteins were also identified. Some clusters, such as tandemly duplicated gene pairs, include genes from the same category. One cluster located on chromosome 4 includes four genes encoding tri-snRNP proteins (atTri65a, atTri65b, atTri65c and atTri15.5c, homologs of tri-snRNP 65-KD protein and 15.5 KD protein). Two other clusters, atU2A-atCdc5 and atCUG-BP1-atU1C, include genes from different functional categories. No clear clusters of genes for snRNA-splicing-related proteins were identified. Although about one third of snRNA genes are located near other protein-coding genes, none of their neighboring genes is related to splicing. As a caveat, we should point out that our snRNA gene determination strongly suggests annotation errors in overlapping protein-coding gene models. Thus, atU2-1, atU2.3, atU4.2, atU4-11p, atU5-13 and atU6.26 overlap gene models At1g16820, At3g57770, At3g06895, At1g68390, At5g53740 and At3g13857, respectively, but none of these models is well supported by cDNA or EST evidence (see displays linked at ASRG [30]).
The 260 proteins in the first three categories could be grouped into 130 families, 66 of which consist of multiple members. The duplication rate is over 50%, which is higher than the 44% duplication rate of Arabidopsis transcription factors [86]. As shown in Table 3, about 50% of genes encoding snRNP proteins, 43% of splicing factors, and 78% of splicing regulators have duplications. The much higher duplication rate of splicing regulators may reflect diversification in splicing control.
At least 130 duplication events are required to yield the 260 proteins from 130 families given one single-copy ancestor per family. Thirty-three duplication events (about a quarter of the total) are likely to be the result of chromosome duplications. The chromosomal duplication ratio is 18.9-27.5% among the three groups (see Table 3). Some snRNA genes pairs, such as U2-14/U2-10, U5-3/U5-5 and (U6.1 U6.26)/(U6-8p U6-9p), may also have been produced by chromosome duplication. The C.D.2-3 region (chromosome duplication region between chromosomes 2 and 3, see [87]) has the most splicing-related gene pairs. Six genes in this region on chromosome 2 were duplicated in the same order on chromosome 3. EST evidence shows that all these genes are expressed. Three U5 snRNA genes (U5.1, U5.1b and U5-4) and four U2 snRNA genes (U2.2, U2.3, U2.4 and U2.6) also are located in the same region on chromosome 3. No U5 and U2 homologs exist in the corresponding region on chromosome 2, suggesting that the snRNA duplication events in that region may have happened after the chromosome duplication event, or that the snRNA duplicates were lost subsequent to chromosome duplication.
Chromosomal duplication rather than individual gene duplication appears to be the predominant mode of amplification for some types of genes. As shown in Table 2

Alternative splicing of Arabidopsis splicingrelated genes
According to EST/cDNA alignments, 80 of the 260 protein coding genes show 66 alternative splicing events. This rate (30.8%) is much higher than the overall frequency of alternative splicing in Arabidopsis, which is about 13% using the same criteria (2,747 genes out of 20,446 genes with EST/ cDNA evidence; B.-B.W. and V.B., unpublished work). As shown in Table 4, the snRNP protein-coding genes have the lowest alternative splicing ratio (24.2%), whereas the ratios for splicing factor and splicing regulator genes are both over 33%. More than half of the genes encoding EJC proteins, proteins specific for the B∆U1 complex, SR proteins, U11/U12 snRNP-specific proteins and U1 snRNP proteins undergo alternative splicing.
Among different types of alternative splicing, intron retention is the most abundant of the alternative transcripts identified for the 260 classified splicing-related genes. As shown in  ping is an important post-transcriptional method for controlling the expression of splicing factor coding genes but not the splicing regulator genes.

Discussion
Previous studies had determined 30 snRNA genes and 46 protein-coding genes related to splicing in Arabidopsis (see Tables 1 and 2). In this study, we have computationally identified an additional 44 snRNA genes (Table 1) and 349 protein-coding genes ( Table 2) that also may be involved in splicing. Among the five types of U snRNAs, U6 is the most conserved and U1 is the least conserved. We identified seven U1-U4 snRNA gene clusters. We were surprised to see so many U1-U4 clusters in Arabidopsis. In Drosophila, four snRNA clusters were reported [4], but none of them includes U1-U4 gene pairs. It is likely that a U1-U4 snRNA cluster existed in a progenitor of the current Arabidopsis genome, which was duplicated several times to form the extant seven clusters. The non-clustered U1 and U4 snRNA genes may have arisen by individual gene duplication or gene loss in duplicated clusters.
Among the proteins involved in splicing, most animal homologs are conserved in plants, indicating an ancient, monophylytic origin for the splicing mechanism. A striking feature of plant splicing-related genes is their duplication ratio. Fifty percent of the splicing genes are duplicated in Arabidopsis. The duplication ratio of the splicing-related genes increases from genes encoding snRNP proteins to genes encoding splicing regulators. These data strongly suggest that the general splicing mechanism is conserved, but that the control of splicing may be more diverse in plants.
The high duplication ratio of Arabidopsis splicing-related genes could be the result of evolutionary selection. Unlike animals, which can move around to maintain more homogeneous physiological conditions, plants are exposed to a larger range of stress conditions such as heat and cold. The duplicates will more probably be maintained in the genome as their functions become diversified, and potentially plant-specific, to ensure the fidelity of splicing under such varied conditions. Chromosome duplication has produced several Sm proteins, SR proteins and hnRNP proteins in Arabidopsis, which in turn could create positive selective pressures influencing the rate of duplication for functionally related genes. Because chromosome duplication occurred differentially within each plant lineage, we would expect different duplication patterns of these genes in, for example, monocots and dicots.
To confirm the above hypothesis, we searched the recently sequenced rice genome using the five Arabidopsis SC35 and SC35-like proteins as probes. Eight distinct genome loci were found to encode SC35 and SC35-like proteins, including three The column entries are the numbers of genes in which the respective alternative splicing events can occur. AltA, alternative acceptor site; AltD, alternative donor site; AltP, alternative intron position (both acceptor and donor sites are different); ExonS, exon skipping; IntronR, intron retention. The Overall and Ratio columns give the number and fraction of genes with any type of alternative splicing, respectively.
Phylogenetic tree of the SC35 protein family Figure 3 Phylogenetic tree of the SC35 protein family. The phylogenetic tree was constructed on the basis of protein sequence alignments of the SC35 homologs in human, Drosophila, Arabidopsis and rice. The GenBank accession numbers for the sequences are as follows: hsSC35, Q01130; hsSRrp40, AAL57514; hsSRrp35: AAL57515; dmSC35, AAF53192; atSC35, NP_851261; atSR33/SCL33, NP_564685; atSCL30a, NP_187966; atSCL30, NP_567021; atSCL28, NP_197382; osSC35a, BAC79909; osSC35b, BAD09319; osSR33-1, AAP46199; osSCL30a/SR33-2, BAC799901; osSCL30-2, BAD19168. The sequences were aligned using CLUSTALW [22] with default parameters, and the phylogenetic tree was produced according to the neighbor-joining method using PAM substitution model distances as implemented in the PHYLIP package [103]. homologs of atSC35, two homologs of atSR33/SCL33 and atSCL30a, two homologs of atSCL30, and one homolog of atSCL28. Five of the eight rice genes are currently annotated in GenBank with accession numbers BAC79909 (osSC35a), BAD09319 (osSC35b), AAP46199 (osSR33-1), BAC799901 (osSCL30a/osSR33-2), and BAD19168 (osSCL30-1). As shown in the phylogenetic tree displayed in Figure 3, the two rice SC35 genes and atSC35 are likely to be orthologs of the animal SC35 gene. The other sequences cluster in SC35-like (SRrp/TASR) clades, indicating that the SC35 and SRrp/ TASR genes diverged before the divergence of monocot and dicot plants (the divergence presumably happened even before the divergence of animals and plants, as described earlier). In addition, there are species-specific duplications. Thus, the Arabidopsis chromosomal duplication pair atSR33 -atSCL30a forms a clade, while their rice copies (osSR33-1 and osSCL30a) form another clade. Also there are additional duplications for the rice SC35 and SCL30 genes. We are currently working to identify all rice splicing related genes. The complete sets of these genes in two plant species should provide a good foundation for assessing similarities and differences in splicing mechanisms used by monocot and dicot plants.
As introns evolve rapidly, the mechanism to recognize and splice them should either evolve correspondingly or be flexible enough to accommodate the changes. It seems that plants deploy the most economic and practical way by keeping a largely conserved splicing mechanism and a very flexible recognition and control mechanism. Direct evidence comes from the presence of plant-specific splicing proteins, such as the novel SR protein family and the superfamily of hnRNP A/B. The absence of SMN complex and some yeast U1 snRNP proteins in Arabidopsis indicates that other organisms also have integrated new proteins or pathways into the splicing mechanism over the course of evolution relative to other eukaryotes. Other evidence supporting the conserved splicing but flexible regulating mechanism include differential conservation among U snRNAs (U1 snRNAs are less conserved than U6 snRNAs) and high alternative splicing frequency in U1 snRNP proteins, SR proteins and hnRNP proteins. The SR proteins and U1 snRNP proteins are involved in early steps of splicing and 5' and 3' splice-site selection; multiple isoforms of these proteins may be functionally significant in the control of splicing.
It is interesting to note that the overall alternative splicing frequency in splicing related genes is much higher than the frequency averaged over all Arabidopsis genes. More than half of SR proteins and U1 snRNP proteins show alternative splicing. Alternative splicing might increase protein diversity derived from splicing-related genes, which would further add flexibility to the splicing mechanism. The high frequency of alternative transcripts from splicing related genes raises another interesting question -how is splicing regulated in these splicing-related genes? One possible answer is that some splicing-related genes may be autoregulated. Accumulation of one transcript would feed back to inhibit/promote other isoforms. Several splicing-related genes have been reported to be regulated in this way. For example, AtGRP7 (hnRNP A/B superfamily) is a circadian clock-regulated protein which negatively autoregulates its expression [79]. When the AtGRP7 protein accumulates over the circadian cycle, it promotes production of alternative transcripts which use a cryptic 5' splice site. As a result of message instability, the alternative transcripts contain pre-mature stop codons and do not accumulate to high levels, thus decreasing the level of AtGRP7 protein [79]. atSRp30 has similar effects on its own transcripts [65]. Another possible answer is that some splicing-related genes might regulate the splicing of other splicing-related genes. For example, overexpression of AtGRP7 and atSRp30 is known to affect the splicing of AtGRP8 and atSR1, respectively [65,79]. A third possibility is that the environment could affect the alternative splicing pattern. A good example is the SR1 gene. The ratio of two transcripts from the SR1 gene (SR1B/SR1) increases in a temperature-dependent manner [67]. Generally, heat or cold stress could cause intron retention in some splicing regulators, which could further alter the splicing pattern of other genes. The fourth possible regulators are intronless genes. Combining all these possibilities, a pathway to regulate splicing could be inferred as follows: environmental changes → splicing pattern changes in some specific splicing-related genes and/or intronless genes → expression pattern changes (including splicing pattern changes) in general splicing related genes → changes in splicing patterns for specific genes.

Conclusions
A large number of Arabidopsis splicing-related genes were computationally identified in this study by means of sequence comparisons and motif searches, including a tentative U4atac snRNA gene containing all conserved motifs, a new SR protein-coding gene (atRSp32) belonging to the atRSp31 family, and several genes related to genes encoding known splicing-related proteins (atULrp and atFCA2). A web-accessible database containing all the Arabidopsis splicing related genes has been constructed and will be expanded to other organisms in the near future. This compilation should provide a good foundation to study the splicing process in more detail and to determine to what extent these genes are conserved across the entire plant kingdom. Our data show that about 50% of the splicing-related genes are duplicated in Arabidopsis. The duplication ratios for splicing regulators are even higher, indicating that the splicing mechanism is generally conserved among plants, but that the regulation of splicing may be more variable and flexible, thus enabling plants to respond to their specific environments.

Search for Arabidopsis snRNAs
Sequences of the 15 experimentally identified major snRNAs were downloaded from GenBank. The two minor snRNAs sequences were compiled from the literature [28]. These genes were used to search against the Arabidopsis genome at the AtGDB BLAST server [88] and at the SALK T-DNA Express web server [89]. Our initial analysis was based on Release 3.0 of the Arabidopsis genome (GenBank accession numbers NC_003070.4, NC_003071.3, NC_003074.4, NC_003075.3, and NC_003076.4). Local BLAST [21] was used to derive the locations of the snRNA homologs from more recently sequenced regions of the genome. Criteria used for local BLAST were 'e 1 -F F -W 7' (cutoff eval is 1, dust filter on, with a minimum word size of 7). Human and maize snRNAs were also included as query sequences, and all hits with e-values less than 10 -5 were regarded as possible homologs. A total of 70 major snRNAs and three minor snRNAs were identified by this method. Each major snRNA type has 10-18 copies in the genome. A tentative gene name and gene model were assigned to each snRNA gene after comparison with the snRNAs identified in MATDB [90]. Sequence-similarity values were based on BLAST alignments.

Search for Arabidopsis splicing-related proteins
A three-round BLAST search strategy was used to identify Arabidopsis splicing related protein-coding genes. First, sequences of splicing-related proteins from human and Drosophila were downloaded from GenBank according to several recent proteomic studies [15][16][17][18] and the website compilation of Stephen Mount's group available at [91]. Human hnRNP proteins identified in a recent review [76] were downloaded from GenBank. All these sequences were used as queries in a local BLAST search against Arabidopsis annotated proteins (obtained from TIGR at [92]). All hits with an e-value less than 10 -10 were collected as candidates. Many of these candidates had highly significant e-values (usually 10 -30 or below and much lower than other hits). These candidates were regarded as true homologs.
In the second step, all identified true homologs were used to query the Arabidopsis protein set again. An e-value of 10 -20 was used as a cutoff value to find possible paralogs of the true homologs. Sequences identified in both rounds of BLAST hits were regarded as main candidates for splicing related proteins.
Finally, the main candidates were queried against GenPept and all annotated human proteins (obtained from Ensembl [93]). All candidates with significant similarity to proteins unrelated to splicing were removed from the main candidate list, and all candidates with significant similarity to proteins related to splicing were regarded as true splicing-related genes and were promoted to the status of true homologs. The remaining candidates were regarded as unclassified splicingrelated proteins. BLAST results were initially analyzed by MuSeqBox [94]). Two custom scripts were written to read MuSeqBox output files, largely automating the search procedure.

Gene structure and chromosomal locations
The gene structure and chromosomal locations for the genes encoding splicing-related proteins were retrieved from AtGDB [95]. The chromosomal locations of the snRNA genes were inferred from the BLAST results. The location maps ( Figure 1) were generated using the AtGDB advanced search function [96]. Spliced alignments of ESTs and cDNAs generated by GeneSeqer [97] were used to verify gene models. Gene structure information was used as an important criteria to group homologs into gene families.

Protein domains
InterProScan 3.3 was downloaded from [98] and was subsequently used to search protein domain databases using default parameters [99]. A Perl script was written to process the text results from InterProScan. Protein domain information was used in comparisons of homologs from different species. The search of the National Center for Biotechnology Information Conserved Domain Database (NCBI-CDD) [100] was conducted manually for certain genes to confirm the InterPro results.

Duplication source
The gene families with multiple copies were inspected to determine whether they were likely to have derived from chromosome-duplication events. Gene models of the duplicated gene were searched against the gene list of each chromosome redundancy region at MATDB [101]. If the gene and its duplicate were both in the list, they were regarded as a chromosome duplication pair. Otherwise, they were assumed to be produced by random gene duplication.

Identification of alternative splicing
All Arabidopsis ESTs and cDNAs were aligned against the genome using the spliced alignment program GeneSeqer as made available through AtGDB [102]. We retrieved the intron and exon coordinates of the reliable cognate alignments from the database. Scripts were written to identify introns that overlap with other introns or exons. We defined the alternative splicing cases as follows: alternative donor (AltD): an intron has the same 3'-end coordinate but different 5'-end coordinate as another overlapping intron; alternative acceptor (AltA): an intron has the same 5'-end coordinate but different 3'-end coordinate as another intron; alternative position (AltP): an intron has different 5'-end and 3'-end coordinates as another overlapping intron; exon skipping (ExonS): an annotated intron completely contains an alternatively identified exon in the same transcription direction; intron retention (IntronR): an annotated intron is completely contained by an alternatively identified exon.