The ASRG database: identification and survey of Arabidopsis thalianagenes involved in pre-mRNA splicing
Genome Biology volume 5, Article number: R102 (2004)
A total of 74 small nuclear RNA (snRNA) genes and 395 genes encoding splicing-related proteins were identified in the Arabidopsis genome by sequence comparison and motif searches, including the previously elusive U4atac snRNA gene. Most of the genes have not been studied experimentally. Classification of these genes and detailed information on gene structure, alternative splicing, gene duplications and phylogenetic relationships are made accessible as a comprehensive database of Arabidopsis Splicing Related Genes (ASRG) on our website.
Most eukaryotic genes contain introns that are spliced from the precursor mRNA (pre-mRNA). The correct interpretation of splicing signals is essential to generate authentic mature mRNAs that yield correct translation products. As an important post-transcriptional mechanism, gene function can be controlled at the level of splicing through the production of different mRNAs from a single pre-mRNA (reviewed in ). The general mechanism of splicing has been well studied in human and yeast systems and is largely conserved between these organisms. Plant RNA splicing mechanisms remain comparatively poorly understood, due in part to the lack of an in vitro plant splicing system. Although the splicing mechanisms in plants and animals appear to be similar overall, incorrect splicing of plant pre-mRNAs in mammalian systems (and vice versa) suggests that there are plant-specific characteristics, resulting from coevolution of splicing factors with the signals they recognize or from the requirement for additional splicing factors (reviewed in [2, 3]).
Genome projects are accelerating research on splicing. For example, with the majority of splicing-related genes already known in human and budding yeast, these gene sequences were used to query the Drosophila and fission yeast genomes in an effort to identify potential homologs [4, 5]. Most of the known genes were found to have homologs in both Drosophila and fission yeast. The availability of the near-complete genome of Arabidopsis thaliana  provides the foundation for the simultaneous study of all the genes involved in particular plant structures or physiological processes. For example, Barakat et al.  identified and mapped 249 genes encoding ribosomal proteins and analyzed gene number, chromosomal location, evolutionary history (including large-scale chromosomal duplications) and expression of those genes. Beisson et al.  catalogued all genes involved in acyl lipid metabolism. Wang et al.  surveyed more than 1,000 Arabidopsis protein kinases and computationally compared derived protein clusters with established gene families in budding yeast. Previous surveys of Arabidopsis gene families that contain some splicing-related genes include the DEAD box RNA helicase family  and RNA-recognition motif (RRM)-containing proteins . At present, the Arabidopsis Information Resource (TAIR) links to more than 850 such expert-maintained collections of gene families .
Here we present the results of computational identification of potentially all or nearly all Arabidopsis genes involved in pre-mRNA splicing. Recent mass spectrometry analyses revealed more than 200 proteins associated with human spliceosomes ([13–17], reviewed in ). By extensive sequence comparisons using known plant and animal splicing-related proteins as queries, we have identified 74 small nuclear RNA (snRNA) genes and 395 protein-coding genes in the Arabidopsis genome that are likely to be homologs of animal splicing-related genes. About half of the genes occur in multiple copies in the genome and appear to have been derived both from chromosomal duplication events and from duplication of individual genes. All genes were classified into gene families, named and annotated with respect to their inferred gene structure, predicted protein domain structure and presumed function. The classification and analysis results are available as an integrated web resource, the database of Arabidopsis Splicing Related Genes (ASRG), which should facilitate genome-wide studies of pre-mRNA splicing in plants.
ASRG: a database of Arabidopsissplicing-related genes
Our up-to-date web-accessible database comprising the Arabidopsis splicing-related genes and associated information is available at . The web pages display gene structure, alternative splicing patterns, protein domain structure and potential gene duplication origins in tabular format. Chromosomal locations and spliced alignment of cognate cDNAs and expressed sequence tags (ESTs) are viewable via links to the Arabidopsis genome database AtGDB , which also provides other associated information for these genes and links to other databases. Text-search functions are accessible from all the web pages. Sequence-analysis tools including BLAST  and CLUSTAL W  are integrated and facilitate comparison of splicing-related genes and proteins across various species.
A total of 15 major snRNA and two minor snRNA genes were previously identified experimentally in Arabidopsis [23–28]. These genes were used as queries to search the Arabidopsis genome for other snRNA genes. A total of 70 major snRNAs and three minor snRNAs were identified by this method. In addition, a single U4atac snRNA gene was identified by sequence motif search. We assigned tentative gene names and gene models as shown in Table 1, together with chromosome locations and similarity scores relative to a representative query sequence. The original names for known snRNAs were preserved, following the convention atUx.y, where x indicates the U snRNA type and y the gene number. Computationally identified snRNAs were named similarly, but with a hyphen instead of a period separating type from gene number (atUx-y). Putative pseudogenes were indicated with a 'p' following the gene name. Pseudogene status was assigned to gene models for which sequence similarity to known genes was low, otherwise conserved transcription signals are missing and the gene cannot fold into typical secondary structure. A recent experimental study of small non-messenger RNAs identified 14 tentative snRNAs in Arabidopsis by cDNA cloning (, GenBank accessions 22293580 to 22293592 and 22293600, Table 1). All these newly identified snRNAs were found in the set of our computationally predicted genes.
Conservation of major snRNA genes
As shown in Table 1, each of five major snRNA genes (U1, U2, U4, U5 and U6) exists in more than 10 copies in the Arabidopsis genome. U2 snRNA has the largest copy number, with a total of 18 putative homologs identified. Both U1 and U5 snRNAs have 14 copies, U6 snRNA has 13 copies, and U4 snRNA has only 11 copies. Sequence comparisons within Arabidopsis snRNA gene families showed that the U6 snRNA genes are the most similar, and the U1 snRNA genes are the most divergent. Eight active U6 snRNA copies are more than 93% identical to each other in the genic region, whereas active U1 snRNAs are on average only 87% identical. The U2 and U4 snRNAs are also highly conserved within each type, with more than 92% identity among the active genes. Details about the individual snRNAs and the respective sequence alignments are displayed at .
Previous studies identified two conserved transcription signals in most major snRNA gene promoters: USE (upstream sequence element, RTCCACATCG (where R is either A or G) and TATA box [24–27]. All 14 U5 snRNAs have the USE and TATA box. Furthermore, their predicted secondary structures are similar to the known structure of their counterparts in human, indicating that all these genes are active and functional (structure data not shown; for a review of the structures of human snRNAs, see ). Similarly, we identified 17 U2, 10 U1, nine U4, and nine U6 snRNA genes as likely active genes, with a few additional genes more likely to be pseudogenes because of various deletions. U4-10 and U6-7 do not have the conserved USE in the promoter region, but their U4-U6 interaction regions (stem I and stem II) are fairly well conserved. U2-16 is also missing the USE but has a secondary structure similar to other U2 snRNAs. These genes may be active, but differences in promoter motifs suggest that their expression may be under different control compared with other snRNAs homologs. The U2-17 snRNA has all conserved transcription signals, but 20 nucleotides are missing from its 3' end. The predicted secondary structure of U2-17 is similar to that of other U2 snRNAs, with a significantly shorter stem-loop in the 3' end as a result of the deletion. We are not sure if the U2-17 snRNA is functional, but the conserved transcription signals imply that it may be active.
Other conserved transcription signals were also identified in most active snRNAs, including the sequence element CAANTC (where N is either A, C, G or T) in U2 snRNAs (located at -6 to -1) , and the termination signal CAN3-10AGTNNAA in U snRNAs (U1, U2, U4 and U5) transcribed by RNA polymerase II (Pol II) [23, 24, 32]. The previously identified monocot-specific promoter element (MSP, RGCCCR, located upstream of USE) in U6.1 and U6.26  is also found in five other U6 snRNA genes (U6.29, U6-2, U6-3, U6-4, U6-5). In all seven U6 snRNAs the consensus MSP sequence extends by two thymine nucleotides to RGCCCRTT. Although the MSP does not contribute significantly to U6 snRNA transcription initiation in Nicotiana plumbaginifolia protoplasts , the extended consensus may imply a role in gene expression regulation in Arabidopsis.
Low copy number of minor snRNA genes
The minor snRNAs are functional in the splicing of U12-type (AT-AC) introns. Four types of minor snRNAs, which correspond to four types of major snRNAs, exist in mammals. U11 is the analog of U1, U12 is the analog of U2, U4atac is the analog of U4, and U6atac is the analog of U6. The U5 snRNA seems to function in both the major and minor spliceosome . Two minor snRNAs (atU12 and atU6atac) were experimentally identified in Arabidopsis . Both have the conserved USE and TATA box in the promoter region. We identified another U6atac gene (atU6atac-2) by sequence mapping. This gene has a USE and a TATA box in the promoter region. The atU6atac-2 gene is more than 90% similar to atU6atac in both its 5' and 3' ends, with a 10-nucletotide deletion in the central region. The putative U4atac-U6atac interaction region in atU6atac-2 is 100% conserved with the interaction region previously identified in atU6atac [28, 35].
U11 and U4atac have not been experimentally identified in Arabidopsis. BLAST searches using the human U11 and U4atac homologs as queries against the Arabidopsis genome failed to find any significant hits, indicating divergence of the minor snRNAs in plants and mammals. Using the strategy described below, we successfully identified a putative Arabidopsis U4atac gene. It is a single-copy gene containing all conserved functional domains. We also found a single candidate U11 snRNA gene (chromosome 5, from 17,492,101 to 17,492,600) that has the USE and TATA box in the promoter region. This gene also contains a putative binding site fr Sm protein and a region that could pair with the 5' splice site of the U12-type intron.
Identification of an Arabidopsis U4atacsnRNA gene
Like U4 snRNA and U6 snRNA, human U4atac and U6atac snRNAs interact with each other through base pairing . The same interaction is expected to exist between the Arabidopsis homologs. Therefore, we deduced the tentative AtU4atac stem II sequence (CCCGTCTCTGTCAGAGGAG) from AtU6atac snRNA and searched for matching sequences in the Arabidopsis genome. Hit regions together with flanking regions 500 base-pairs (bp) upstream and 500 bp downstream were retrieved and screened for transcription signals (USE and TATA box). One sequence was identified that contains both the USE and TATA box in appropriate positions, as shown in Figure 1.
The tentative U4atac snRNA gene contains not only the stem II sequence, but also the stem I sequence that presumably base-pairs with U6atac snRNA stem I. Furthermore, a highly conserved Sm-protein-binding region exists at the 3' end. The predicted secondary structure is nearly identical to hsU4atac, with a relative longer single-stranded region (data not shown). With the highly conserved transcriptional signals, functional domains and secondary structure, this candidate gene is likely to be a real U4atac snRNA homolog. We named it AtU4atac and assigned At4g16065 as its tentative gene model because it is located between gene models At4g16060 and At4g16070 on chromosome 4.
Tandem arrays of snRNAs genes
Some snRNAs genes exist as small groups on the Arabidopsis chromosomes . We identified 10 snRNA gene clusters: seven U1-U4 snRNA clusters, one U2-U5 snRNA cluster, and a tandem duplication for both U2 snRNA (U2-10) and U5 snRNA (U5.1) (Figure 2). All seven Arabidopsis U1-U4 clusters have the U1 snRNA gene located upstream of the U4 snRNA gene, with a 180-300-nucleotide intergenic region. Five of the U1-U4 arrays are located on chromosome 5 (U1a/U4.1, U1-4/U4-5, U1-8/U4-7, U1-9/U4-8, and U1-13p/U4.3p), and the remaining two on chromosome 1 (U1-10/U4-6 and U1-14p/U4-10). The U2-17 and U5-10 occur in tandem array on chromosome 5, separated by fewer than 200 nucleotides.
Arabidopsissplicing-related protein-coding genes
Most of the proteins involved in splicing in mammals and Drosophila are known [4, 37, 38]. In addition, recent proteomics studies revealed many novel proteins associated with human spliceosomes (reviewed in ). Using all these animal proteins as query sequences, we identified a total of 395 tentative homologs in Arabidopsis. Sequence-similarity scores and comparison of gene structure and protein domain structure were used to assign the genes to families. Each gene was assigned a tentative name based on the name of its respective animal homolog. Different homologs within a gene family were labeled by adding an Arabic number (1, 2, and so on) to the name. Close family members with similar gene structure were indicated by adding -a, -b, and -c to the name. The 395 genes were classified into five different categories according to the presumed function of their products. Ninety-one encode small nuclear ribonucleoprotein particle (snRNP) proteins, 109 encode splicing factors, and 60 encode potential splicing regulators. Details of EST evidence, alternative splicing patterns, duplication sources and domain structure of these genes are listed in Table 2. We also identified 84 Arabidopsis proteins corresponding to 54 human spliceosome-associated proteins. The remaining 51 genes encode proteins with domains or sequences similar to known splicing factors, but without enough similarity to allow unambiguous classification. These two categories are not discussed in detail here, but information about these genes is available at our ASRG site .
The majority of snRNP proteins are conserved in Arabidopsis
There are five snRNPs (U1, U2, U4, U5 and U6) involved in the formation of the major spliceosome, corresponding to five snRNAs. Five snRNPs (U1 snRNP, U2 snRNP, U5 snRNP, U4/U6 snRNP and U4.U6/U5 tri-snRNP) have been isolated experimentally in yeast or human [40–45]. Each snRNP contains the snRNA, a group of core proteins, and some snRNP-specific proteins. Most of these proteins are conserved in Arabidopsis. All U snRNPs except U6 snRNP contain seven common core proteins bound to snRNAs. These core proteins all have an Sm domain and have been called Sm proteins. The U6 snRNP contains seven LSM proteins ('like Sm' proteins). Another LSM protein (LSM1) is not involved in binding snRNA (reviewed in ).
As shown in Table 2, all Sm and LSM proteins have homologs in Arabidopsis, and eight of them are duplicated. It is likely that these genes existed as single copies in the ancestor of animals and plants, but duplicated within the plant lineage. Only one of the 24 genes (LSM5, At5g48870) has been characterized experimentally in Arabidopsis. The LSM5 gene was cloned from a mutant supersensitive to ABA (abscisic acid) and drought (SAD1 ). LSM5 is expressed at low levels in all tissues and its transcription is not altered by drought stress . cDNA and EST evidence exist for all other core protein genes, indicating that all 24 genes are active.
There are 63 Arabidopsis proteins corresponding to the 35 snRNP-specific proteins used as queries in our genome mapping. Very few of them have been characterized experimentally, including U1-70K, U1A and a tandem duplication pair of SAP130 [48–50]. U1-70K was reported as a single-copy essential gene. Expression of U1-70K antisense transcript under the APETALA3 promoter suppressed the development of sepals and petals . We identified an additional homolog of U1-70K (At2g43370) and named it U1-70K2. The U1-70K2 proteins showed 48% similarity to the U1-70K protein according to Blast2 results. Both genes retain the sixth intron in some transcripts, a situation which would produce truncated proteins . Interestingly, we found that five of the 10 Arabidopsis U1 snRNP proteins, including the U1-70K-coding genes, may undergo alternative splicing.
Several genes in U2, U5, U4/U6 and U4.U6/U5 snRNPs, but none in U1 snRNP, occur in more than three copies in the Arabidopsis genome. The atSAP114 family has five members, including two that occur in tandem (atSAP114-1a and atSAP114-1b). Three members have EST/cDNA evidence (Table 2). Interestingly, the predicted atSAP114p (At4g15580) protein contains a RNase H domain at the amino-terminal end, and thus atSAP114p shares similarity to At5g06805, a gene annotated as encoding a non-LTR retroelement reverse transcriptase-like protein. It is likely that the atSAP114p gene is a pseudogene that originated by retroelement insertion. There are three copies of the gene for the tri-snRNP 65 kilodalton (kDa) subunit, which are clustered on chromosome 4. Both the U4/U6 90 kDa protein and the U4/U6 15.5 kDa protein also have three gene copies, and the 116 kDa and 200 kDa subunits in U5 snRNP have four copies apiece.
The yeast U1 snRNP contains several specific proteins that are not present in mammalian U1 snRNPs . As in mammals, Arabidopsis also lacks homologs of Prp42, a component of U1 snRNP in yeast . However, Arabidopsis has two copies of the gene for Prp39, which are similar to Prp42. Furthermore, atPrp39a can produce a shorter protein isoform with a novel amino-terminal sequence by exon skipping. It is possible that the duplicates and alternative isoforms of plant U1 snRNP proteins are functional homologs of the yeast-specific proteins.
Several proteins specific to the minor spliceosome are also conserved in Arabidopsis. The human 18S U11/U12 snRNP contains several proteins found in U2 snRNP as well as seven novel proteins . Four of the seven U11/U12-specific proteins (U11/U12-35K, 25K, 65K and 31K) are conserved in Arabidopsis, while the remaining three (59K, 48K and 20K) have no clear homologs. Interestingly, all four Arabidopsis genes are single copy in the genome, and three of them are apparently alternatively spliced (Table 2).
Splicing factors are slightly different in Arabidopsisthan in other organisms
We divided the splicing factors into eight subgroups according to recent human spliceosome studies [13, 14, 16, 18]: splice-site selection proteins; SR proteins; 17S U2 associated proteins; 35S U5 associated proteins; proteins specific to the BΔU1 complex; exon junction complex (EJC) proteins; second-step splicing factors and other known splicing factors. We focused our analysis on the first two subgroups because their functions in splicing are well established. A total of 109 proteins in Arabidopsis were identified, corresponding to 67 human queries from all eight subgroups. Most of the proteins are conserved among eukaryotes, but some human proteins have no obvious homologs in the Arabidopsis genome, and some novel splicing factors appear to exist in Arabidopsis. About 43% of genes encoding splicing factors are duplicated in the genome, whereas some proteins, such as SF1/BBP (branchpoint-binding protein, which facilitates U2 snRNP binding in fission yeast ) and cap-binding proteins (CBP20 and CBP80, possibly involved in cap proximal intron splicing ), derive from single-copy genes . These single-copy gene products may work with all pre-mRNAs, including the ones with U12-type introns. Surprisingly, mutation of CBP80 (ABH1) is not lethal and is non-pleiotropic. The abh1 plants show ABA-hypersensitive closure of stomata and reduced wilting during drought .
Many splicing factors have been identified previously in Arabidopsis, including two U2AF65, two U2AF35, and 18 SR proteins [58–67]. The U2AF35-related protein atUrp, which could interact with U2AF65 and position RS-domain-containing splicing factors , is also present in the Arabidopsis genome. Although the Urp gene is expressed ubiquitously in human tissues, no ESTs from this gene were found in Arabidopsis. Three copies of PTB/hnRNP-I genes were identified in Arabidopsis. The PTB protein competes for the poly-pyrimidine tract with the U2AF large subunit, thus negatively regulating splicing .
We also identified a homolog related to atU2AF65 (At2g33440) and an additional SR protein (At2g46610). The U2AF65-related protein (atULrp, At2g33440) has 247 amino acids and shares over 40% similarity with the carboxy terminal region of the two atU2AF65 homologs. Only one RRM can be identified in atULrp, in contrast to three RRMs and one amino-terminal RS domain in atU2AF65 proteins, and there is no apparent RS domain in atULrp. No animal homolog of atULrp could be identified. The function of this one-RRM U2AF65-related protein is not clear. As it lacks other functional motifs, it might act as a competitor of U2AF65. A two-RRM U2AF65 protein can be produced through alternative splicing. The 11th intron of atU2AF65a can be retained (see RAFL full-length cDNA, gi:19310596) to produce a truncated protein with only the first two RRMs. Interestingly, the last RRM in atU2AF65a contains several amino-acid variations from the consensus pattern such that it could not be detected by InterPro and NCBI-CDD searches using default values, also suggesting that perhaps only the first two RRMs are essential.
The additional SR protein belongs to the atRSp31 family and was named atRSp32 (At2g46610). It shares 70% identity and 78% similarity with atRSp31. The protein is 250 amino acids in length and contains two RRMs and some RS dipeptides in the carboxy-terminal region. The gene structure of atRSp32 is similar to that of atRSp31. Two other genes (atRSp40 and atRSp41) are in the same family and also have similar exon and intron sizes (see gene structure information at ). Similarly to the previous classification of 18 SR proteins , the 19 SR proteins (including SR45) can be grouped into four large families of four to five members according to sequence similarity, gene structure and protein domain structure.
The atRSp31 family (atRSp31, atRSp32, atRSp40 and atRSp41) belongs to a novel plant SR family and has no clear animal ortholog. Other families include the SC35 (or SRrp/TASR2) family, SF2/ASF family, and the 9G8 family. Arabidopsis has a single copy of the SC35 ortholog and four SC35-like proteins (atSR33, atSCL30a, atSCL30 and atSCL28), which appear to have diverged significantly from SC35. It seems that this divergence predates the split of plants and animals because a similar SC35-like gene family exists in the human genome (SRrp35 and SRrp40). The SRrp35 and SRrp40 were found to antagonize other SR proteins in vitro and function in 5' splice-site selection . SF2/ASF has four copies (atSR1/SRp34, atSRp30, atSRp34a and atSRp34b) with similar gene structures and domains. Human 9G8 protein has five homologs in Arabidopsis, with three (atRSZp21, atRSZp22 and atRSZp22a) containing one CCHC-type zinc finger and two (atRSZ33, atRSZ32) containing two CCHC-type zinc fingers in addition to an RRM and an RS domain. Interestingly, several SR proteins (atRSZp21, atRSZp22, SR45 and SCL33) were found to interact with atU1-70K, and some SR proteins can interact with each other, thus forming a complicated interaction network to facilitate splice-site selection and spliceosome assembly [3, 61–63]. atSR45 was initially regarded as a novel plant SR protein , but by virtue of sequence-similarity scores it actually may be the ortholog of the human RNPS1 gene, which encodes an EJC protein. Other human SR proteins (SRp20, SRp30c, SRp40, SRp54, SRp55 and SRp75) lack clear orthologs in Arabidopsis. We conclude that SR protein families evolved differently in animals and plants from three to four common ancestors, including SC35, SF2/ASF and 9G8/RSZ. The SRrp (SC35-like in plants) family may not be classical SR proteins but they play important roles in splice-site selection.
Proteins in other subgroups, such as 17S U2 snRNP-associated proteins, 35S U5 snRNP-associated proteins, and protein specific to the BΔU1 complex, are also conserved in Arabidopsis. The BΔU1 complex is the spliceosome complex captured immediately before catalytic activation. Most proteins in the 35S U5 snRNP are absent in the BΔU1 complex but present in the active B complex, indicating the important roles of 35S U5 snRNP-associated proteins in spliceosome activation . Conservation of these proteins in Arabidopsis revealed the same pathway of spliceosome activation in plants. A subcomplex named Prp19 complex in 35S U5 snRNP has a critical role in spliceosome activation [13, 72]. All proteins in the human Prp19 complex have homologs in Arabidopsis, including a chromosomal duplication pair of Prp19 genes and a single copy of the CDC5 gene. For the BΔU1 complex, six human genes have homologs, and five of them are single copy in Arabidopsis. Two genes (NPW38BP/SNP70 and p220(NPAT)) in the human BΔU1 complex have no apparent Arabidopsis homologs.
Arabidopsis also lacks an SMN protein complex. In human, the SMN protein (survival of motor neurons) can interact with a series of proteins including Gemin2, Gemin3 (a helicase), Gemin4, Gemin5 and Gemin6 to form an SMN complex, which has important roles in the biogenesis of snRNPs and the assembly of the spliceosome through direct interactions with Sm proteins and snRNA . Although the SMN protein exists in the fission yeast genome (GenBank accession CAA91173), no SMN complex members can be identified in the Arabidopsis genome.
Splicing regulators are expanded in Arabidopsis
Splicing regulators are proteins that can either modify splicing factors or compete with splicing factors for their binding site. Important splicing regulators are hnRNP proteins and SR protein kinases. The exact role of phosphorylation of SR proteins in splicing is not yet clear, but SR protein kinases are well conserved and exist as multiple copies in Arabidopsis. A total of eight SR protein kinases were identified in Arabidopsis, including three Lammer/CLK kinases (AFC1, AFC2 and AFC3), two SRPK1 homologs, and three SPRK2 homologs. The three Lammer/CLK kinases were identified previously, and AFC2 was shown to phosphorylate SR protein in vitro [63, 74]. Overexpression of tobacco AFC2 homolog PK12 in Arabidopsis changed the alternative splice patterns of several genes, including atSRp30, atSR1/atSRp34 and U1-70K , indicating that these SR proteins may function to modulate splicing in plants.
The heterogeneous nuclear ribonucleoproteins (hnRNPs) bind to splice sites and to binding sites for splicing factors on nascent pre-mRNAs, thus competing with splicing factors to negatively control splicing (reviewed in ). Humans have about 20 hnRNP proteins, many of which function in splicing. A total of 35 potential hnRNP proteins possibly related to splicing was found in Arabidopsis by sequence-similarity searches, including a superfamily of glycine-rich RNA-binding proteins. This family contains 21 members similar to human hnRNP A1 and hnRNP A2/B1. It can be further divided into two subfamilies. One includes eight proteins containing one RRM, and another has 13 members with two RRMs. 12 of these proteins were identified previously, including AtGRP7, AtGRP8, UBA2a, UBA2b, UBA2c and AtRNPA/B1-6 [11, 77, 78]. AtGRP7 was found to be able to influence alternative splicing of its own transcripts as well as AtGRP8 transcripts . UBA2 proteins can interact with UBP1 and UBA1 proteins, which have three RRMs and one RRM respectively, to recognize U-rich sequences in the 3' untranslated region (UTR) and stabilize mRNA . Although the overexpression of UBA2 did not stimulate splicing of a reporter gene in tobacco protoplasts , we cannot rule out the possibility that it could be involved in splicing of other genes.
Other human hnRNPs related to splicing also have homologs in Arabidopsis. BLAST searches of the human (CUG)n triplet repeat RNA-binding protein (CUG-BP) against all Arabidopsis proteins revealed three putative homologs, including atFCA. atFCA and CUG-BP share similarity within the RRMs and a region approximately 40 amino acids in length. An additional protein (At2g47310) related to FCA was identified and named FCA2, as it shares about 50% similarity with FCA. The FCA proteins have two RRMs and a WW domain, which interact with the FY protein, a homolog of yeast polyadenylation factor Psf2p [80, 81]. The FCA-FY complex negatively regulates the FCA protein by favoring a polyadenylation site from the third intron of FCA pre-mRNA [80, 82]. FCA may be a multifunctional protein involved in mRNA processing, as human CUG-BP can function in both alternative splicing and deadenylation . We also list 15 previously identified hnRNP-like proteins and two additional homologs as possible splicing regulators. The UBP1 proteins can strongly enhance splicing of some introns in protoplasts , whereas UBA1, RBP45 and RBP47 proteins have no similar function [78, 85].
Unclassified splicing protein candidates
In addition to the 260 proteins in the above three categories, there are also 84 Arabidopsis proteins corresponding to human spliceosome-associated proteins identified in recent proteomic studies [15–18]. Some of these proteins function in other processes, such as transcription, polyadenylation and even translation. Their association with spliceosomes provides evidence for the coupling of splicing and other processes. Other proteins have no known functions. Only 35.8% of the proteins in this category are duplicated in Arabidopsis. We also identified a total of 51 Arabidopsis protein-coding genes similar to known splicing proteins. They have conserved domains and some level of sequence similarity to known splicing factors. We did not include these two categories in Table 2, but detailed information about them is available at ASRG .
Distribution and duplication of Arabidopsissplicing-related genes
The distribution of Arabidopsis snRNA and splicing-related proteins across the genome is shown in Figure 2 and at the ASRG website. Overall, the genes appear evenly distributed on the chromosomes, with several small gene clusters. Only four snRNA genes are located on chromosome 2, three of which are U2 snRNA genes. No U4 snRNA gene is located on chromosome 4. For the protein-coding genes, most functional categories have members located on each chromosome. The only exception is the SR protein kinase family, which has no member on chromosome 1. Interestingly, chromosome 1 contains the most snRNP proteins and splicing factors, but has the fewest splicing regulators. Several gene clusters encoding splicing-related proteins were also identified. Some clusters, such as tandemly duplicated gene pairs, include genes from the same category. One cluster located on chromosome 4 includes four genes encoding tri-snRNP proteins (atTri65a, atTri65b, atTri65c and atTri15.5c, homologs of tri-snRNP 65-KD protein and 15.5 KD protein). Two other clusters, atU2A-atCdc5 and atCUG-BP1-atU1C, include genes from different functional categories. No clear clusters of genes for snRNA-splicing-related proteins were identified. Although about one third of snRNA genes are located near other protein-coding genes, none of their neighboring genes is related to splicing. As a caveat, we should point out that our snRNA gene determination strongly suggests annotation errors in overlapping protein-coding gene models. Thus, atU2-1, atU2.3, atU4.2, atU4-11p, atU5-13 and atU6.26 overlap gene models At1g16820, At3g57770, At3g06895, At1g68390, At5g53740 and At3g13857, respectively, but none of these models is well supported by cDNA or EST evidence (see displays linked at ASRG ).
The 260 proteins in the first three categories could be grouped into 130 families, 66 of which consist of multiple members. The duplication rate is over 50%, which is higher than the 44% duplication rate of Arabidopsis transcription factors . As shown in Table 3, about 50% of genes encoding snRNP proteins, 43% of splicing factors, and 78% of splicing regulators have duplications. The much higher duplication rate of splicing regulators may reflect diversification in splicing control.
At least 130 duplication events are required to yield the 260 proteins from 130 families given one single-copy ancestor per family. Thirty-three duplication events (about a quarter of the total) are likely to be the result of chromosome duplications. The chromosomal duplication ratio is 18.9-27.5% among the three groups (see Table 3). Some snRNA genes pairs, such as U2-14/U2-10, U5-3/U5-5 and (U6.1 U6.26)/(U6-8p U6-9p), may also have been produced by chromosome duplication. The C.D.2-3 region (chromosome duplication region between chromosomes 2 and 3, see ) has the most splicing-related gene pairs. Six genes in this region on chromosome 2 were duplicated in the same order on chromosome 3. EST evidence shows that all these genes are expressed. Three U5 snRNA genes (U5.1, U5.1b and U5-4) and four U2 snRNA genes (U2.2, U2.3, U2.4 and U2.6) also are located in the same region on chromosome 3. No U5 and U2 homologs exist in the corresponding region on chromosome 2, suggesting that the snRNA duplication events in that region may have happened after the chromosome duplication event, or that the snRNA duplicates were lost subsequent to chromosome duplication.
Chromosomal duplication rather than individual gene duplication appears to be the predominant mode of amplification for some types of genes. As shown in Table 2, the 24 genes encoding core proteins have nine duplication pairs, five of which can be attributed to chromosomal duplications. The 19 SR protein genes include eight duplication pairs, six of which are probably the results of chromosomal duplications. At least five chromosomal duplication events contributed to the superfamily of 21 hnRNP glycine-rich RBD and A/B genes. It is not clear why these functional categories have high chromosomal duplication ratios. It is possible that chromosomal duplication could create positive selection to maintain similar copy numbers of other genes encoding proteins that interact with the products of already duplicated genes.
Alternative splicing of Arabidopsissplicing-related genes
According to EST/cDNA alignments, 80 of the 260 protein coding genes show 66 alternative splicing events. This rate (30.8%) is much higher than the overall frequency of alternative splicing in Arabidopsis, which is about 13% using the same criteria (2,747 genes out of 20,446 genes with EST/cDNA evidence; B.-B.W. and V.B., unpublished work). As shown in Table 4, the snRNP protein-coding genes have the lowest alternative splicing ratio (24.2%), whereas the ratios for splicing factor and splicing regulator genes are both over 33%. More than half of the genes encoding EJC proteins, proteins specific for the BΔU1 complex, SR proteins, U11/U12 snRNP-specific proteins and U1 snRNP proteins undergo alternative splicing.
Among different types of alternative splicing, intron retention is the most abundant of the alternative transcripts identified for the 260 classified splicing-related genes. As shown in Table 4, 44 of the total 80 alternative splicing genes (about 55%) involve intron retention, 28 (35%) involve alternative acceptor-site selection and 15 (18.7%) are due to exon skipping. Compared with the corresponding ratio in all Arabidopsis alternative splicing events (55.3% intron retention, 23.4% alternative acceptor-site selection and 6.3% exon skipping; B.-B.W. and V.B., unpublished work), the ratio of intron retention in splicing-related genes is similar and the ratio of exon skipping is higher. Interestingly, only one of the 20 splicing regulator genes processed by alternative splicing (about 5%) shows exon skipping, indicating that exon skipping is an important post-transcriptional method for controlling the expression of splicing factor coding genes but not the splicing regulator genes.
Previous studies had determined 30 snRNA genes and 46 protein-coding genes related to splicing in Arabidopsis (see Tables 1 and 2). In this study, we have computationally identified an additional 44 snRNA genes (Table 1) and 349 protein-coding genes (Table 2) that also may be involved in splicing. Among the five types of U snRNAs, U6 is the most conserved and U1 is the least conserved. We identified seven U1-U4 snRNA gene clusters. We were surprised to see so many U1-U4 clusters in Arabidopsis. In Drosophila, four snRNA clusters were reported , but none of them includes U1-U4 gene pairs. It is likely that a U1-U4 snRNA cluster existed in a progenitor of the current Arabidopsis genome, which was duplicated several times to form the extant seven clusters. The non-clustered U1 and U4 snRNA genes may have arisen by individual gene duplication or gene loss in duplicated clusters.
Among the proteins involved in splicing, most animal homologs are conserved in plants, indicating an ancient, monophylytic origin for the splicing mechanism. A striking feature of plant splicing-related genes is their duplication ratio. Fifty percent of the splicing genes are duplicated in Arabidopsis. The duplication ratio of the splicing-related genes increases from genes encoding snRNP proteins to genes encoding splicing regulators. These data strongly suggest that the general splicing mechanism is conserved, but that the control of splicing may be more diverse in plants.
The high duplication ratio of Arabidopsis splicing-related genes could be the result of evolutionary selection. Unlike animals, which can move around to maintain more homogeneous physiological conditions, plants are exposed to a larger range of stress conditions such as heat and cold. The duplicates will more probably be maintained in the genome as their functions become diversified, and potentially plant-specific, to ensure the fidelity of splicing under such varied conditions. Chromosome duplication has produced several Sm proteins, SR proteins and hnRNP proteins in Arabidopsis, which in turn could create positive selective pressures influencing the rate of duplication for functionally related genes. Because chromosome duplication occurred differentially within each plant lineage, we would expect different duplication patterns of these genes in, for example, monocots and dicots.
To confirm the above hypothesis, we searched the recently sequenced rice genome using the five Arabidopsis SC35 and SC35-like proteins as probes. Eight distinct genome loci were found to encode SC35 and SC35-like proteins, including three homologs of atSC35, two homologs of atSR33/SCL33 and atSCL30a, two homologs of atSCL30, and one homolog of atSCL28. Five of the eight rice genes are currently annotated in GenBank with accession numbers BAC79909 (osSC35a), BAD09319 (osSC35b), AAP46199 (osSR33-1), BAC799901 (osSCL30a/osSR33-2), and BAD19168 (osSCL30-1). As shown in the phylogenetic tree displayed in Figure 3, the two rice SC35 genes and atSC35 are likely to be orthologs of the animal SC35 gene. The other sequences cluster in SC35-like (SRrp/TASR) clades, indicating that the SC35 and SRrp/TASR genes diverged before the divergence of monocot and dicot plants (the divergence presumably happened even before the divergence of animals and plants, as described earlier). In addition, there are species-specific duplications. Thus, the Arabidopsis chromosomal duplication pair atSR33 - atSCL30a forms a clade, while their rice copies (osSR33-1 and osSCL30a) form another clade. Also there are additional duplications for the rice SC35 and SCL30 genes. We are currently working to identify all rice splicing related genes. The complete sets of these genes in two plant species should provide a good foundation for assessing similarities and differences in splicing mechanisms used by monocot and dicot plants.
As introns evolve rapidly, the mechanism to recognize and splice them should either evolve correspondingly or be flexible enough to accommodate the changes. It seems that plants deploy the most economic and practical way by keeping a largely conserved splicing mechanism and a very flexible recognition and control mechanism. Direct evidence comes from the presence of plant-specific splicing proteins, such as the novel SR protein family and the superfamily of hnRNP A/B. The absence of SMN complex and some yeast U1 snRNP proteins in Arabidopsis indicates that other organisms also have integrated new proteins or pathways into the splicing mechanism over the course of evolution relative to other eukaryotes. Other evidence supporting the conserved splicing but flexible regulating mechanism include differential conservation among U snRNAs (U1 snRNAs are less conserved than U6 snRNAs) and high alternative splicing frequency in U1 snRNP proteins, SR proteins and hnRNP proteins. The SR proteins and U1 snRNP proteins are involved in early steps of splicing and 5' and 3' splice-site selection; multiple isoforms of these proteins may be functionally significant in the control of splicing.
It is interesting to note that the overall alternative splicing frequency in splicing related genes is much higher than the frequency averaged over all Arabidopsis genes. More than half of SR proteins and U1 snRNP proteins show alternative splicing. Alternative splicing might increase protein diversity derived from splicing-related genes, which would further add flexibility to the splicing mechanism. The high frequency of alternative transcripts from splicing related genes raises another interesting question - how is splicing regulated in these splicing-related genes? One possible answer is that some splicing-related genes may be autoregulated. Accumulation of one transcript would feed back to inhibit/promote other isoforms. Several splicing-related genes have been reported to be regulated in this way. For example, AtGRP7 (hnRNP A/B superfamily) is a circadian clock-regulated protein which negatively autoregulates its expression . When the AtGRP7 protein accumulates over the circadian cycle, it promotes production of alternative transcripts which use a cryptic 5' splice site. As a result of message instability, the alternative transcripts contain pre-mature stop codons and do not accumulate to high levels, thus decreasing the level of AtGRP7 protein . atSRp30 has similar effects on its own transcripts . Another possible answer is that some splicing-related genes might regulate the splicing of other splicing-related genes. For example, overexpression of AtGRP7 and atSRp30 is known to affect the splicing of AtGRP8 and atSR1, respectively [65, 79]. A third possibility is that the environment could affect the alternative splicing pattern. A good example is the SR1 gene. The ratio of two transcripts from the SR1 gene (SR1B/SR1) increases in a temperature-dependent manner . Generally, heat or cold stress could cause intron retention in some splicing regulators, which could further alter the splicing pattern of other genes. The fourth possible regulators are intronless genes. Combining all these possibilities, a pathway to regulate splicing could be inferred as follows: environmental changes → splicing pattern changes in some specific splicing-related genes and/or intronless genes → expression pattern changes (including splicing pattern changes) in general splicing related genes → changes in splicing patterns for specific genes.
A large number of Arabidopsis splicing-related genes were computationally identified in this study by means of sequence comparisons and motif searches, including a tentative U4atac snRNA gene containing all conserved motifs, a new SR protein-coding gene (atRSp32) belonging to the atRSp31 family, and several genes related to genes encoding known splicing-related proteins (atULrp and atFCA2). A web-accessible database containing all the Arabidopsis splicing related genes has been constructed and will be expanded to other organisms in the near future. This compilation should provide a good foundation to study the splicing process in more detail and to determine to what extent these genes are conserved across the entire plant kingdom. Our data show that about 50% of the splicing-related genes are duplicated in Arabidopsis. The duplication ratios for splicing regulators are even higher, indicating that the splicing mechanism is generally conserved among plants, but that the regulation of splicing may be more variable and flexible, thus enabling plants to respond to their specific environments.
Materials and methods
Search for ArabidopsissnRNAs
Sequences of the 15 experimentally identified major snRNAs were downloaded from GenBank. The two minor snRNAs sequences were compiled from the literature . These genes were used to search against the Arabidopsis genome at the AtGDB BLAST server  and at the SALK T-DNA Express web server . Our initial analysis was based on Release 3.0 of the Arabidopsis genome (GenBank accession numbers NC_003070.4, NC_003071.3, NC_003074.4, NC_003075.3, and NC_003076.4). Local BLAST  was used to derive the locations of the snRNA homologs from more recently sequenced regions of the genome. Criteria used for local BLAST were 'e 1 -F F -W 7' (cutoff eval is 1, dust filter on, with a minimum word size of 7). Human and maize snRNAs were also included as query sequences, and all hits with e-values less than 10-5 were regarded as possible homologs. A total of 70 major snRNAs and three minor snRNAs were identified by this method. Each major snRNA type has 10-18 copies in the genome. A tentative gene name and gene model were assigned to each snRNA gene after comparison with the snRNAs identified in MATDB . Sequence-similarity values were based on BLAST alignments.
Search for Arabidopsissplicing-related proteins
A three-round BLAST search strategy was used to identify Arabidopsis splicing related protein-coding genes. First, sequences of splicing-related proteins from human and Drosophila were downloaded from GenBank according to several recent proteomic studies [15–18] and the website compilation of Stephen Mount's group available at . Human hnRNP proteins identified in a recent review  were downloaded from GenBank. All these sequences were used as queries in a local BLAST search against Arabidopsis annotated proteins (obtained from TIGR at ). All hits with an e-value less than 10-10 were collected as candidates. Many of these candidates had highly significant e-values (usually 10-30 or below and much lower than other hits). These candidates were regarded as true homologs.
In the second step, all identified true homologs were used to query the Arabidopsis protein set again. An e-value of 10-20 was used as a cutoff value to find possible paralogs of the true homologs. Sequences identified in both rounds of BLAST hits were regarded as main candidates for splicing related proteins.
Finally, the main candidates were queried against GenPept and all annotated human proteins (obtained from Ensembl ). All candidates with significant similarity to proteins unrelated to splicing were removed from the main candidate list, and all candidates with significant similarity to proteins related to splicing were regarded as true splicing-related genes and were promoted to the status of true homologs. The remaining candidates were regarded as unclassified splicing-related proteins. BLAST results were initially analyzed by MuSeqBox ). Two custom scripts were written to read MuSeqBox output files, largely automating the search procedure.
Gene structure and chromosomal locations
The gene structure and chromosomal locations for the genes encoding splicing-related proteins were retrieved from AtGDB . The chromosomal locations of the snRNA genes were inferred from the BLAST results. The location maps (Figure 1) were generated using the AtGDB advanced search function . Spliced alignments of ESTs and cDNAs generated by GeneSeqer  were used to verify gene models. Gene structure information was used as an important criteria to group homologs into gene families.
InterProScan 3.3 was downloaded from  and was subsequently used to search protein domain databases using default parameters . A Perl script was written to process the text results from InterProScan. Protein domain information was used in comparisons of homologs from different species. The search of the National Center for Biotechnology Information Conserved Domain Database (NCBI-CDD)  was conducted manually for certain genes to confirm the InterPro results.
The gene families with multiple copies were inspected to determine whether they were likely to have derived from chromosome-duplication events. Gene models of the duplicated gene were searched against the gene list of each chromosome redundancy region at MATDB . If the gene and its duplicate were both in the list, they were regarded as a chromosome duplication pair. Otherwise, they were assumed to be produced by random gene duplication.
Identification of alternative splicing
All Arabidopsis ESTs and cDNAs were aligned against the genome using the spliced alignment program GeneSeqer as made available through AtGDB . We retrieved the intron and exon coordinates of the reliable cognate alignments from the database. Scripts were written to identify introns that overlap with other introns or exons. We defined the alternative splicing cases as follows: alternative donor (AltD): an intron has the same 3'-end coordinate but different 5'-end coordinate as another overlapping intron; alternative acceptor (AltA): an intron has the same 5'-end coordinate but different 3'-end coordinate as another intron; alternative position (AltP): an intron has different 5'-end and 3'-end coordinates as another overlapping intron; exon skipping (ExonS): an annotated intron completely contains an alternatively identified exon in the same transcription direction; intron retention (IntronR): an annotated intron is completely contained by an alternatively identified exon.
Database and interface construction
Details about each splicing-related gene were saved in a MySQL database. PHP scripts were written to interact with the database and generate the interface web pages. Text and BLAST searches were implemented by Perl-cgi scripts.
Kazan K: Alternative splicing and proteome diversity in plants: the tip of the iceberg has just emerged. Trends Plant Sci. 2003, 8: 468-471. 10.1016/j.tplants.2003.09.001.
Lorkovic ZJ, Wieczorek Kirk DA, Lambermon MH, Filipowicz W: Pre-mRNA splicing in higher plants. Trends Plant Sci. 2000, 5: 160-167. 10.1016/S1360-1385(00)01595-8.
Reddy ASN: Nuclear pre-mRNA splicing in plants. Critical Rev Plant Sci. 2001, 20: 523-571. 10.1016/S0735-2689(01)80004-6.
Mount SM, Salz HK: Pre-messenger RNA processing factors in the Drosophila genome. J Cell Biol. 2000, 150: F37-F44. 10.1083/jcb.150.2.F37.
Käufer NF, Potashkin J: Analysis of the splicing machinery in fission yeast: a comparison with budding yeast and mammals. Nucleic Acids Res. 2000, 28: 3003-3010. 10.1093/nar/28.16.3003.
Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.
Barakat A, Szick-Miranda K, Chang IF, Guyot R, Blanc G, Cooke R, Delseny M, Bailey-Serres J: The organization of cytoplasmic ribosomal protein genes in the Arabidopsis genome. Plant Physiol. 2001, 127: 398-415. 10.1104/pp.127.2.398.
Beisson F, Koo AJ, Ruuska S, Schwender J, Pollard M, Thelen JJ, Paddock T, Salas JJ, Savage L, Milcamps A, et al: Arabidopsis genes involved in acyl lipid metabolism. A 2003 census of the candidates, a study of the distribution of expressed sequence tags in organs, and a web-based database. Plant Physiol. 2003, 132: 681-697. 10.1104/pp.103.022988.
Wang D, Harper JF, Gribskov M: Systematic trans-genomic comparison of protein kinases between Arabidopsis and Saccharomyces cerevisiae. Plant Physiol. 2003, 132: 2152-2165. 10.1104/pp.103.021485.
Aubourg S, Kreis M, Lecharny A: The DEAD box RNA helicase family in Arabidopsis thaliana. Nucleic Acids Res. 1999, 27: 628-636. 10.1093/nar/27.2.628.
Lorkovic ZJ, Barta A: Genome analysis: RNA recognition motif (RRM) and K homology (KH) domain RNA-binding proteins from the flowering plant Arabidopsis thaliana. Nucleic Acids Res. 2002, 30: 623-635. 10.1093/nar/30.3.623.
TAIR: gene family information. [http://www.arabidopsis.org/info/genefamily/genefamily.html]
Makarova OV, Makarov EM, Urlaub H, Will CL, Gentzel M, Wilm M, Lührmann R: A subset of human 35S U5 proteins, including Prp19, function prior to catalytic step 1 of splicing. EMBO J. 2004, 23: 2381-2391. 10.1038/sj.emboj.7600241.
Will CL, Schneider C, Hossbach M, Urlaub H, Rauhut R, Elbashir S, Tuschl T, Lührmann R: The human 18S U11/U12 snRNP contains a set of novel proteins not found in the U2-dependent spliceosome. RNA. 2004, 10: 929-941. 10.1261/rna.7320604.
Zhou Z, Sim J, Griffith J, Reed R: Purification and electron microscopic visualization of functional human spliceosomes. Proc Natl Acad Sci USA. 2002, 99: 12203-12207. 10.1073/pnas.182427099.
Makarov EM, Makarova OV, Urlaub H, Gentzel M, Will CL, Wilm M, Lührmann R: Small nuclear ribonucleoprotein remodeling during catalytic activation of the spliceosome. Science. 2002, 298: 2205-2208. 10.1126/science.1077783.
Rappsilber J, Ryder U, Lamond AI, Mann M: Large-scale proteomic analysis of the human spliceosome. Genome Res. 2002, 12: 1231-1245. 10.1101/gr.473902.
Jurica MS, Moore MJ: Pre-mRNA splicing: awash in a sea of proteins. Mol Cell. 2003, 12: 5-14. 10.1016/S1097-2765(03)00270-3.
Arabidopsis Splicing Related Genes Database. [http://www.plantgdb.org/prj/SiP/SRGD/ASRG]
Arabidopsis thaliana Genome Database. [http://www.plantgdb.org/AtGDB]
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.
Vankan P, Filipowicz W: Structure of U2 snRNA genes of Arabidopsis thaliana and their expression in electroporated plant protoplasts. EMBO J. 1988, 7: 791-799.
Vankan P, Edoh D, Filipowicz W: Structure and expression of the U5 snRNA gene of Arabidopsis thaliana. Conserved upstream sequence elements in plant U-RNA genes. Nucleic Acids Res. 1988, 16: 10425-10440.
Vankan P, Filipowicz W: A U-snRNA gene-specific upstream element and a -30 'TATA box' are required for transcription of the U2 snRNA gene of Arabidopsis thaliana. EMBO J. 1989, 8: 3875-3882.
Waibel F, Filipowicz W: U6 snRNA genes of Arabidopsis are transcribed by RNA polymerase III but contain the same two upstream promoter elements as RNA polymerase II-transcribed U-snRNA genes. Nucleic Acids Res. 1990, 18: 3451-3458.
Hofmann CJ, Marshallsay C, Waibel F, Filipowicz W: Characterization of the genes encoding U4 small nuclear RNAs in Arabidopsis thaliana. Mol Biol Rep. 1992, 17: 21-28.
Shukla GC, Padgett RA: Conservation of functional features of U6atac and U12 snRNAs between vertebrates and higher plants. RNA. 1999, 5: 525-538. 10.1017/S1355838299982213.
Marker C, Zemann A, Terhorst T, Kiefmann M, Kastenmayer JP, Green P, Bachellerie JP, Brosius J, Huttenhofer A: Experimental RNomics: identification of 140 candidates for small non-messenger RNAs in the plant Arabidopsis thaliana. Curr Biol. 2002, 12: 2002-2013. 10.1016/S0960-9822(02)01304-0.
ASRG snRNAs. [http://www.plantgdb.org/prj/SiP/SRGD/ASRG/AtsnRNA.php]
Patel AA, Steitz JA: Splicing double: insights from the second spliceosome. Nat Rev Mol Cell Biol. 2003, 4: 960-970. 10.1038/nrm1259.
Connelly S, Filipowicz W: Activity of chimeric U small nuclear RNA (snRNA)/mRNA genes in transfected protoplasts of Nicotiana plumbaginifolia: U snRNA 3'-end formation and transcription initiation can occur independently in plants. Mol Cell Biol. 1993, 13: 6403-6415.
Connelly S, Marshallsay C, Leader D, Brown JW, Filipowicz W: Small nuclear RNA genes transcribed by either RNA polymerase II or RNA polymerase III in monocot plants share three promoter elements and use a strategy to regulate gene expression different from that used by their dicot plant counterparts. Mol Cell Biol. 1994, 14: 5910-5919.
Tarn WY, Steitz JA: Pre-mRNA splicing: the discovery of a new spliceosome doubles the challenge. Trends Biochem Sci. 1997, 22: 132-137. 10.1016/S0968-0004(97)01018-9.
Shukla GC, Padgett RA: U4 small nuclear RNA can function in both the major and minor spliceosomes. Proc Natl Acad Sci USA. 2004, 101: 93-98. 10.1073/pnas.0304919101.
Shukla GC, Cole AJ, Dietrich RC, Padgett RA: Domains of human U4atac snRNA required for U12-dependent splicing in vivo. Nucleic Acids Res. 2002, 30: 4650-4657. 10.1093/nar/gkf609.
Krämer A: The structure and function of proteins involved in mammalian pre-mRNA splicing. Annu Rev Biochem. 1996, 65: 367-409. 10.1146/annurev.bi.65.070196.002055.
Will CL, Lührmann R: Protein functions in pre-mRNA splicing. Curr Opin Cell Biol. 1997, 9: 320-328. 10.1016/S0955-0674(97)80003-8.
ASRG proteins. [http://www.plantgdb.org/prj/SiP/SRGD/ASRG/ASRP-home.php]
Stevens SW, Abelson J: Purification of the yeast U4/U6.U5 small nuclear ribonucleoprotein particle and identification of its proteins. Proc Natl Acad Sci USA. 1999, 96: 7226-7231. 10.1073/pnas.96.13.7226.
Stevens SW, Barta I, Ge HY, Moore RE, Young MK, Lee TD, Abelson J: Biochemical and genetic analyses of the U5, U6, and U4/U6 × U5 small nuclear ribonucleoproteins from Saccharomyces cerevisiae. RNA. 2001, 7: 1543-1553.
Gottschalk A, Neubauer G, Banroques J, Mann M, Lührmann R, Fabrizio P: Identification by mass spectrometry and functional analysis of novel proteins of the yeast [U4/U6.U5] tri-snRNP. EMBO J. 1999, 18: 4535-4548. 10.1093/emboj/18.16.4535.
Caspary F, Shevchenko A, Wilm M, Seraphin B: Partial purification of the yeast U2 snRNP reveals a novel yeast pre-mRNA splicing factor required for pre-spliceosome assembly. EMBO J. 1999, 18: 3463-3474. 10.1093/emboj/18.12.3463.
Krämer A, Grüter P, Gröning K, Kastner B: Combined biochemical and electron microscopic analyses reveal the architecture of the mammalian U2 snRNP. J Cell Biol. 1999, 145: 1355-1368. 10.1083/jcb.145.7.1355.
Fabrizio P, Esser S, Kastner B, Lührmann R: Isolation of S. cerevisiae snRNPs: comparison of U1 and U4/U6.U5 to their human counterparts. Science. 1994, 264: 261-265.
Will CL, Lührmann R: Spliceosomal UsnRNP biogenesis, structure and function. Curr Opin Cell Biol. 2001, 13: 290-301. 10.1016/S0955-0674(00)00211-8.
Xiong L, Gong Z, Rock CD, Subramanian S, Guo Y, Xu W, Galbraith D, Zhu JK: Modulation of abscisic acid signal transduction and biosynthesis by an Sm-like protein in Arabidopsis. Dev Cell. 2001, 1: 771-781. 10.1016/S1534-5807(01)00087-9.
Golovkin M, Reddy AS: Structure and expression of a plant U1 snRNP 70K gene: alternative splicing of U1 snRNP 70K pre-mRNAs produces two different transcripts. Plant Cell. 1996, 8: 1421-1435. 10.1105/tpc.8.8.1421.
Simpson GG, Clark GP, Rothnie HM, Boelens W, van Venrooij W, Brown JW: Molecular characterization of the spliceosomal proteins U1A and U2B' from higher plants. EMBO J. 1995, 14: 4540-4550.
Casacuberta E, Puigdomenech P, Monofort A: A genomic duplication in Arabidopsis thaliana contains a sequence similar to the human gene coding for SAP130. Plant Physiol Biochem. 2001, 39: 565-573. 10.1016/S0981-9428(01)01280-3.
Golovkin M, Reddy AS: Expression of U1 small nuclear ribonucleoprotein 70K antisense transcript using APETALA3 promoter suppresses the development of sepals and petals. Plant Physiol. 2003, 132: 1884-1891. 10.1104/pp.103.023192.
Gottschalk A, Tang J, Puig O, Salgado J, Neubauer G, Colot HV, Mann M, Seraphin B, Rosbash M, Lührmann R, Fabrizio P: A comprehensive biochemical and genetic analysis of the yeast U1 snRNP reveals five novel proteins. RNA. 1998, 4: 374-393.
McLean MR, Rymond BC: Yeast pre-mRNA splicing requires a pair of U1 snRNP-associated tetratricopeptide repeat proteins. Mol Cell Biol. 1998, 18: 353-360.
Huang T, Vilardell J, Query CC: Pre-spliceosome formation in S. pombe requires a stable complex of SF1-U2AF(59)-U2AF(23). EMBO J. 2002, 21: 5516-5526. 10.1093/emboj/cdf555.
Lewis JD, Gorlich D, Mattaj IW: A yeast cap binding protein complex (yCBC) acts at an early step in pre-mRNA splicing. Nucleic Acids Res. 1996, 24: 3332-3336. 10.1093/nar/24.17.3332.
Kmieciak M, Simpson CG, Lewandowska D, Brown JW, Jarmolowski A: Cloning and characterization of two subunits of Arabidopsis thaliana nuclear cap-binding complex. Gene. 2002, 283: 171-183.
Hugouvieux V, Kwak JM, Schroeder JI: An mRNA cap binding protein, ABH1, modulates early abscisic acid signal transduction in Arabidopsis. Cell. 2001, 106: 477-487. 10.1016/S0092-8674(01)00460-3.
Domon C, Lorkovic ZJ, Valcarcel J, Filipowicz W: Multiple forms of the U2 small nuclear ribonucleoprotein auxiliary factor U2AF subunits expressed in higher plants. J Biol Chem. 1998, 273: 34603-34610. 10.1074/jbc.273.51.34603.
Lopato S, Waigmann E, Barta A: Characterization of a novel arginine/serine-rich splicing factor in Arabidopsis. Plant Cell. 1996, 8: 2255-2264. 10.1105/tpc.8.12.2255.
Lopato S, Mayeda A, Krainer AR, Barta A: Pre-mRNA splicing in plants: characterization of Ser/Arg splicing factors. Proc Natl Acad Sci USA. 1996, 93: 3074-3079. 10.1073/pnas.93.7.3074.
Lopato S, Forstner C, Kalyna M, Hilscher J, Langhammer U, Indrapichate K, Lorkovic ZJ, Barta A: Network of interactions of a novel plant-specific Arg/Ser-rich protein, atRSZ33, with atSC35-like splicing factors. J Biol Chem. 2002, 277: 39989-39998. 10.1074/jbc.M206455200.
Golovkin M, Reddy AS: The plant U1 small nuclear ribonucleoprotein particle 70K protein interacts with two novel serine/arginine-rich proteins. Plant Cell. 1998, 10: 1637-1648. 10.1105/tpc.10.10.1637.
Golovkin M, Reddy AS: An SC35-like protein and a novel serine/arginine-rich protein interact with Arabidopsis U1-70K protein. J Biol Chem. 1999, 274: 36428-36438. 10.1074/jbc.274.51.36428.
Lazar G, Schaal T, Maniatis T, Goodman HM: Identification of a plant serine-arginine-rich protein similar to the mammalian splicing factor SF2/ASF. Proc Natl Acad Sci USA. 1995, 92: 7672-7676.
Lopato S, Kalyna M, Dorner S, Kobayashi R, Krainer AR, Barta A: atSRp30, one of two SF2/ASF-like proteins from Arabidopsis thaliana, regulates splicing of specific plant genes. Genes Dev. 1999, 13: 987-1001.
Lopato S, Gattoni R, Fabini G, Stevenin J, Barta A: A novel family of plant splicing factors with a Zn knuckle motif: examination of RNA binding and splicing activities. Plant Mol Biol. 1999, 39: 761-773. 10.1023/A:1006129615846.
Lazar G, Goodman HM: The Arabidopsis splicing factor SR1 is regulated by alternative splicing. Plant Mol Biol. 2000, 42: 571-581. 10.1023/A:1006394207479.
Tronchere H, Wang J, Fu XD: A protein related to splicing factor U2AF35 that interacts with U2AF65 and SR proteins in splicing of pre-mRNA. Nature. 1997, 388: 397-400. 10.1038/41137.
Lin CH, Patton JG: Regulation of alternative 3' splice site selection by constitutive splicing factors. RNA. 1995, 1: 234-245.
ASRG SR protein gene structure. [http://www.plantgdb.org/prj/SiP/SRGD/ASRG/Display.php?GID=2.2&Gst=1]
Cowper AE, Caceres JF, Mayeda A, Screaton GR: Serine-arginine (SR) protein-like factors that antagonize authentic SR proteins and regulate alternative splicing. J Biol Chem. 2001, 276: 48908-48914. 10.1074/jbc.M103967200.
Chan SP, Kao DI, Tsai WY, Cheng SC: The Prp19p-associated complex in spliceosome activation. Science. 2003, 302: 279-282. 10.1126/science.1086602.
Yong J, Pellizzoni L, Dreyfuss G: Sequence-specific interaction of U1 snRNA with the SMN complex. EMBO J. 2002, 21: 1188-1196. 10.1093/emboj/21.5.1188.
Bender J, Fink GR: AFC1, a LAMMER kinase from Arabidopsis thaliana, activates STE12-dependent processes in yeast. Proc Natl Acad Sci USA. 1994, 91: 12105-12109.
Savaldi-Goldstein S, Aviv D, Davydov O, Fluhr R: Alternative splicing modulation by a LAMMER kinase impinges on developmental and transcriptome expression. Plant Cell. 2003, 15: 926-938. 10.1105/tpc.011056.
Krecic AM, Swanson MS: hnRNP complexes: composition, structure, and function. Curr Opin Cell Biol. 1999, 11: 363-371. 10.1016/S0955-0674(99)80051-9.
Heintzen C, Melzer S, Fischer R, Kappeler S, Apel K, Staiger D: A light- and temperature-entrained circadian clock controls expression of transcripts encoding nuclear proteins with homology to RNA-binding proteins in meristematic tissue. Plant J. 1994, 5: 799-813. 10.1046/j.1365-313X.1994.5060799.x.
Lambermon MH, Fu Y, Wieczorek Kirk DA, Dupasquier M, Filipowicz W, Lorkovic ZJ: UBA1 and UBA2, two proteins that interact with UBP1, a multifunctional effector of pre-mRNA maturation in plants. Mol Cell Biol. 2002, 22: 4346-4357. 10.1128/MCB.22.12.4346-4357.2002.
Staiger D, Zecca L, Wieczorek Kirk DA, Apel K, Eckstein L: The circadian clock regulated RNA-binding protein AtGRP7 autoregulates its expression by influencing alternative splicing of its own pre-mRNA. Plant J. 2003, 33: 361-371. 10.1046/j.1365-313X.2003.01629.x.
Simpson GG, Dijkwel PP, Quesada V, Henderson I, Dean C: FY is an RNA 3' end-processing factor that interacts with FCA to control the Arabidopsis floral transition. Cell. 2003, 113: 777-787. 10.1016/S0092-8674(03)00425-2.
Macknight R, Bancroft I, Page T, Lister C, Schmidt R, Love K, Westphal L, Murphy G, Sherson S, Cobbett C, Dean C: FCA, a gene controlling flowering time in Arabidopsis, encodes a protein containing RNA-binding domains. Cell. 1997, 89: 737-745. 10.1016/S0092-8674(00)80256-1.
Quesada V, Macknight R, Dean C, Simpson GG: Autoregulation of FCA pre-mRNA processing controls Arabidopsis flowering time. EMBO J. 2003, 22: 3142-3152. 10.1093/emboj/cdg305.
Paillard L, Legagneux V, Osborne HB: A functional deadenylation assay identifies human CUG-BP as a deadenylation factor. Biol Cell. 2003, 95: 107-113. 10.1016/S0248-4900(03)00010-8.
Lambermon MH, Simpson GG, Wieczorek Kirk DA, Hemmings-Mieszczak M, Klahre U, Filipowicz W: UBP1, a novel hnRNP-like protein that functions at multiple steps of higher plant nuclear pre-mRNA maturation. EMBO J. 2000, 19: 1638-1649. 10.1093/emboj/19.7.1638.
Lorkovic ZJ, Wieczorek Kirk DA, Klahre U, Hemmings-Mieszczak M, Filipowicz W: RBP45 and RBP47, two oligouridylate-specific hnRNP-like proteins interacting with poly(A)+ RNA in nuclei of plant cells. RNA. 2000, 6: 1610-1624. 10.1017/S1355838200001163.
Riechmann JL, Heard J, Martin G, Reuber L, Jiang C, Keddie J, Adam L, Pineda O, Ratcliffe OJ, Samaha RR, et al: Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science. 2000, 290: 2105-2110. 10.1126/science.290.5499.2105.
Vision TJ, Brown DG, Tanksley SD: The origins of genomic duplications in Arabidopsis. Science. 2000, 290: 2114-2117. 10.1126/science.290.5499.2114.
AtGDB BLAST. [http://www.plantgdb.org/cgi-bin/PlantGDB/AtGDB/BRview.pl]
T-DNAexpress: the SIGnAL Arabidopsis gene mapping tool. [http://signal.salk.edu/cgi-bin/tdnaexpress]
MIPS: MATDB snRNAs. [http://mips.gsf.de/cgi-bin/proj/thal/search_type?all/185]
Drosophila mRNA processing factors. [http://www.life.umd.edu/labs/Mount/factors]
TIGR ftp site. [ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/SEQUENCES/]
Xing L, Brendel V: Multi-query sequence BLAST output examination with MuSeqBox. Bioinformatics. 2001, 17: 744-745. 10.1093/bioinformatics/17.8.744.
AtGDB advanced search. [http://www.plantgdb.org/cgi-bin/PlantGDB/AtGDB/ASview.pl]
Brendel V, Xing L, Zhu W: Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus. Bioinformatics. 2004, 20: 1157-1169. 10.1093/bioinformatics/bth058.
Zdobnov EM, Apweiler R: InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics. 2001, 17: 847-848. 10.1093/bioinformatics/17.9.847.
NCBI-CDD search. [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml]
Arabidopsis thaliana: MATDB Redundancy Viewer. [http://mips.gsf.de/proj/thal/db/gv/rv/rv_frame.html]
Zhu W, Schlueter SD, Brendel V: Refined annotation of the Arabidopsis genome by complete expressed sequence tag mapping. Plant Physiol. 2003, 132: 469-484. 10.1104/pp.102.018101.
Hirayama T, Shinozaki K: A cdc5+ homolog of a higher plant, Arabidopsis thaliana. Proc Natl Acad Sci USA. 1996, 93: 13371-13376. 10.1073/pnas.93.23.13371.
Landsberger M, Lorkovic ZJ, Oelmuller R: Molecular characterization of nucleus-localized RNA-binding proteins from higher plants. Plant Mol Biol. 2002, 48: 413-421. 10.1023/A:1014089531125.
We thank Shannon Schlueter for help with the web page and database design and implementation. We are also grateful to Shailesh Lal, Carolyn Lawrence and Michael Sparks for discussions and critical reading of the manuscript and to the anonymous reviewers for excellent suggestions. This work was supported in part by a grant from the ISU Plant Sciences Institute and NSF grants DBI-0110189 and DBI-0110254 to V.B.
About this article
Cite this article
Wang, BB., Brendel, V. The ASRG database: identification and survey of Arabidopsis thalianagenes involved in pre-mRNA splicing. Genome Biol 5, R102 (2004). https://doi.org/10.1186/gb-2004-5-12-r102
- Splice Factor
- Intron Retention
- Splice Regulator
- Splice Mechanism
- Alternative Splice Pattern