Concerted gene recruitment in early plant evolution

Analyses of the red algal Cyanidioschyzon genome identified 37 genes that were acquired from non-organellar sources prior to the split of red algae and green plants.

Data sources: Protein sequences for the red alga Cyanidioschyzon merolae were obtained from the Cyanidioschyzon Genome Project [41,73]. EST sequences of several protists were obtained from TBestDB [74]. All other sequences were from the NCBI protein sequence database.
Identification of ancient HGT: Anciently acquired genes in this study include those horizontally acquired prior to the split of red algae and green plants. A list of ancient HGT candidates was first identified based on phylogenomic screening of the Cyanidioschyzon genome using PhyloGenie and the NCBI non-redundant protein sequence database. The vast majority of the genes on this list are predominantly identified in bacteria and archaea, and therefore are likely of prokaryotic origin. To reduce the complications arising from potential cases of IGT, we adopted an approach combining sequence comparison, phylogenetic analyses, and statistical tests. Each gene on the list was first used to search the NCBI protein sequence database. Because of the cyanobacterial origin of plastids and the α-proteobacterial origin of mitochondria, genes with cyanobacterial and plastid-containing eukaryotic homologs as top hits were considered as likely plastidderived; those with proteobacterial and other eukaryotic homologs were considered as likely mitochondrion-derived. These potentially organelle-derived genes were removed from the candidate list and the remaining genes were subject to detailed phylogenetic analyses (see below). Gene tree topologies generated through detailed phylogenetic analyses were subject to careful inspections; any genes that formed a monophyly with cyanobacterial homologs or with proteobacterial and other eukaryotic sequences were also eliminated from further consideration. Additionally, alternative topologies representing various evolutionary scenarios for each gene were statistically evaluated based on AU tests [43] (see below); genes for which a straightforward IGT scenario (versus IGT followed by secondary transfers) could not be rejected (p-value > 0.05) were also removed from the HGT candidate list.
Detailed phylogenetic analyses: Sequences were sampled from representative groups (including major phyla of bacteria and major groups of eukaryotes) within each domain of life (bacteria, archaea, and eukaryotes). Because of the potential for sequence contaminations, eukaryotic EST sequences whose authenticity is suspicious (e.g., high nucleotide sequence percent identity with bacterial homologs and/or absence of homologs in genomes of closely related taxa) were not included in the analyses. Multiple protein sequence alignments were performed using MUSCLE [77] and clustalx [78], followed by crosscomparisons and manual refinement. Only unambiguously aligned sequence portions were used. Phylogenetic analyses were performed with a maximum likelihood method using PHYML [79], a Bayesian inference method using MrBayes [80], and a distance method using the program neighbor of PHYLIP version 3.65 [81] with maximum likelihood distances calculated using TREE-PUZZLE [82]. All maximum likelihood calculations were based on a substitution matrix determined using ProtTest [83] and a mixed model of 4 gamma-distributed rate classes plus invariable sites.
Maximum likelihood distances for bootstrap analyses were calculated using TREE-PUZZLE and PUZZLEBOOT v1.03 (by Michael E. Holder and Andrew J. Roger, available on the web [84]). Branch lengths and topologies of the trees depicted in all figures were calculated with PHYML. For the convenience of presentation, gene trees were rooted using archaeal (or archaeal + eukaryotic) sequences, or paralogous gene copies if ancient gene families were involved, as outgroups; otherwise, trees were rooted in a way that no top hits of the sequence similarity search were used as an outgroup. Nevertheless, all gene trees should be strictly interpreted as unrooted.
AU tests on alternative tree topologies: Following detailed phylogenetic analyses, alternative tree topologies for each remaining HGT candidates were assessed for their statistical confidence using Treefinder [85]. In most cases, multiple constraint trees were generated using Treefinder for each HGT candidate by enforcing a) monophyly of all eukaryotic sequences, b) monophyly of cyanobacterial, plant and other plastid-containing eukaryotic sequences, and c) monophyly of cyanobacterial, plant, and closely related bacterial sequences. These alternative topologies assumed that the subject gene in plants are not HGT-derived; they served as null hypotheses that all eukaryotic sequences have the same eukaryotic or mitochondrial origin or that plants acquired the subject gene from plastids, sometimes followed by secondary HGT to other bacterial groups. AU tests, which have been recommended for general tree tests, were performed on alternative tree topologies (non-HGT hypotheses) and the tree generated from detailed phylogenetic analyses (HGT hypothesis). In this study, topologies with a p-value < 0.05 were rejected.
Prediction of protein localization: Targeting signal of identified protein sequences was predicted using ChloroP [86] and TargetP [87]. Additional information about protein localization in green plants was obtained from The Arabidopsis Information Resource (TAIR).

Protein sequence alignment used for phylogenetic analyses and resulting phylogenetic trees.
Each sequence name includes a GI number from GenBank (or ID number from other databases) followed by species name. Numbers above the branches of the gene tree show bootstrap support values for maximum-likelihood analyses and distance analyses, and posterior probability from Bayesian analyses, respectively. Asterisks indicate support values below 50%. N denotes genes whose homologs are rarely found in cyanobacteria and that likely possessed novel functions; E denotes genes for which plastid-derived homologs already exist in plants; D denotes genes for which a possible replacement of an endogenous homolog cannot be excluded. -  RFVLVPGTQEQIDQLLAERGREATYVGRYRVTDAASLQAAKEA-AGAISV CDO00012939_Cyanophora_paradox
Note: All top hits of GenBank Searches (using Cyanidioschyzon sequence and Arabidopsis accession number NP_974701 as queries) are from gamma and beta-proteobacteria. Multiple copies of this gene exist in plants and in bacteria. One of these copies in red algae and green plants forms a group with homologs from gamma and beta-proteobacteria with strong support, whereas the other copy groups with cyanobacterial sequences with modest support (the Cyanidioschyzon sequence in that group is encoded in the plastidic genome). In Arabidopsis, the protein product of the gamma and beta-proteobacteria-related gene copy is annotated as located in cytoplasm (GenBank accession number NP_974701 and TAIR locus AT4G37670). The EST sequence of glaucophyte Cyanophora was obtained from tbESTdb; this sequences groups with homologs of red algae and green plants in preliminary analyses, but was removed from the detailed phylogenetic analyses because of its very short length (only 32 aa) Figure 1. Molecular phylogeny of GCN5-related N-acetyltransferase. P-value = 0.235 from AU test on the presented tree. AU tests were also performed on alternative topologies including (A) monophyly of all red algal and green plant sequences, and (B) monophyly of cyanobacterial and all red algal, green plant sequences. These alternative topologies investigate if both copies of this gene in red algae and green plants have the same plastidic origin. P-values < 0.001 from AU tests on both alternative topologies.

Glycyl-tRNA synthetase (D)
Note: GlyRS in living organisms exists in forms of homodimer (α 2 ) and heterotetramer (α 2 β 2 ); the former is distributed in eukaryotes, archaea, and many bacteria whereas the latter is only found in bacteria, red algae and green plants. Few bacteria contain both glyRS types. The αand β-subunits of tetrameric glyRS are usually encoded in separate genes. In a few groups, the two subunits are encoded in a single fused gene; these include actinomycetes, chlamydiae, red algae, and green plants. Not only are sequences from actinomycetes, red algae and green plants similar in gene structure, they also have the highest percent identity and share several conserved amino acid residues. Phylogenetic analyses of each of the subunits strongly suggest a common origin of actinomycete, red algal and plant sequences (Figures 2A-2B). Based on the gene structure and molecular phylogeny, it is likely that primary photosynthetic eukaryotes acquired this gene from either actinomycetes or chlamydiae (Chlamydiae are the only bacterial group aside from actinomycetes and plants that possess a fused gene in our database searches). The second scenario requires an independent HGT event from photosynthetic eukaryotes or chlamydiae to actinomycetes. The Arabidopsis sequence (GenBank accession number NP_190394, TAIR locus AT3G48110) is experimentally determined to be targeted to both chloroplasts and mitochondria.    C_130039_Chlamdydomonas_reinh  SPAGVLVAVADRLDSLVGLFAAGCAPSAATDPFALRRAAYGMLQTGRLRK  15836479_Chlamydophila_pneumo  STIGTLLSLLDRLDNLLACFILGLKPTSSHDPYALRRQSLEVLTLEKLAV  15605530_Chlamydia_tracho  STTGALLSILDRIDNLLSCFILGLLPTSSHDPYALRRQSLEILTLQILKT  55296761_Oryza_sativa  TDPGIVLAVTDRLDSLVGLFGAGCQPSSTNDPFGLRRISYGLVQIEDFPK  30692978_Arabidopsis_thalia  TDAGMVLAIGDRLDSLVGLFAAGCQPSSTNDPFGLRRISYGLVQIEMFPK  CM269_Cyanidioschyzon_merolae  TPAGVALALSDRMDSLVGLFGIGLAPKANSDPFGLRRAALGVVQIERFRD  34482366_Wolinella_succin  SPFSAILALAHRLDNLMGLFSLGRIPTGSKDPFALRRAASGVLRISEGKS  15611973_Helicobacter_pylori  SVFSSIVALSLKIDSLFSLFSVGKIPSGSKDPFALRRLSFGLLKIQKKEL  15607085_Aquifex_aeolic  TTTGTILSLSDKIDNLYSFFKAGEIPKGSSDPYGLRRSAFGIIKIPEFPS  66855634_Anaeromyxobacter_deha  SDLGALVAVADRLHSLVGIIGVGEKATGAADPFGLRRAAIGILRIPDFGS  46580306_Desulfovibrio_vulgar  SMCGALLSIADKADTLAGCFGLGMIPTGAADPYALRRCVLGIARIDDFAS  4980715_Thermotoga_mariti  TVIGSILGIADRIDTIVGNFAIGNVPTSSKDPYGLKSKADTIFRIPEFQD   68207212_Desulfitobacterium_ha   SYTGQIVSVADKLDAIVGAFGIGIQPTGSQDPYALRRQAQGVVGITEFVP  50914778_Streptococcus_pyogen  TKVGAVLALADKLDTLLSFFSVGLIPSGSNDPYALRRATQGIVRIENYKP  17133246_Nostoc_sp.  TLTGQVVGLADRLDTLVSIFGLGLIPTGSSDPFALRRAANAVVNIGTLDK  45508785_Anabaena_variab  TLTGQVAGLADRLDTLVSIFGLGLIPTGSSDPFALRRAANAVINIGTLDK  72382003_Prochlorococcus_marin  NNLGNAVSLAERFELLISIYAKGERPSGSSDPYALRRAANGILLINTLNL  46907686_Listeria_monocy  TDLGSLIAIADKLETLIGFFCVNIAPTGSADPFGLRRSAFGAVRIEWFRP  23099403_Oceanobacillus_iheyen  TVIGSVVSVADKLDTIAGCIAVGLVPTGSQDPYGLRRQASAILRIETFKP  68179078_Desulfuromonas_acetox  DNVGAFVSIADKIDTICGCFGVGLIPTGTADPYALRRCAIGVLNIPEFEA  39982449_Geobacter_sulfur  SDIGAFVSIADKLDTICGCFGVGLIPTGSADPYALRRSALGIINIPDFDS  51891713_Symbiobacterium_therm  TGPGIVVALADKLDTLAGYFSIGLIPTGSQDPFALRRAAQGVVQTPEFAA  67931182_Solibacter_usitat  NATALIVSLADKLDTLRGCFGVGMVPTGSKDPFALRRAAQGVVRIPDFEP  1573946_Haemophilus_influe  SLVASAVALADKFDTLTGIFGIGQAPKGSADPFALRRAALGALRIDSAEA  26986805_Pseudomonas_putida  TLTGAAVAIADKLDTLVGIFGIGMLPTGSKDPYALRRAALGVLRIPEAEA  13476182_Mesorhizobium_loti  DPVSVAVALADKLDTLVGFWAIDEKPTGSKDPYALRRAALGVVRIEDGKN  39934122_Rhodopseudomonas_palu  DPVSIAVALADKLDTLVGFWAIDEKPTGSKDPYALRRAALGVIRLDDGKN  23013833_Magnetospirillum_magn  APVSVAVALADKIDSLVGFFAINERPTGSKDPFALRRAALGIIRLDDGAN  68213210_Methylobacillus_flage NKVGMIVALADKLETLAGLFSIGEKPTGEKDPFALRRHAIGILRIPEAAS .

ThiC family protein (D)
CLUSTAL X (1.83.1) multiple sequence alignment EKNVQVMIEGPGHMAINEIAPNMVLEKKLCHGAPFYVLGPIVTDIAPGYD * : *****:.:. : : * : ** : * * HITSAIGAANIGMGGTALLCYV----------------------------32445237_Rhodopirellula_baltic HITSCIGAANAGMHGAAMLCYVTPKEHLGLPNEEDVKQGVIAYKIAAHTA - - Note: All top hits in GenBank searches (using Cyanidioschyzon sequence and Arabidopsis GI 22136156 as queries) are from proteobacteria, firmicutes and spirochaetes. Primary photosynthetic eukaryotic sequences share many conserved residues with non-cyanobacterial sequences, and likely are not of cyanobacterial origin. The same is also supported by the phylogenetic analyses. The Arabidopsis sequence (GenBank accession number NP_180524 and TAIR locus AT2G29630) is annotated as a chloroplast precursor. The glaucophyte Cyanophora sequence was obtained from TBestDB. 539 from AU test for the presented tree. AU tests were also performed on alternative topologies, including (A) monophyly of red algal, green plant and cyanobacterial sequences, and (B) monophyly of red alga, green plant and archaeal sequences. These tests investigate if red algal and green plant sequences has a plastidic or an archaeal (or eukaryotic) origin. P-values < 0.001 from AU tests for both alternative topologies. - -
LGGGYGIAYTRISIEPGRAIVGTSTFTLYEVGTLKRRYVSVDGGMSDNAR 118443019_Clostridium_novyi LGGGFGIYYGSLVIEPGRSIVGNAGTTLYTVGSIKRKYVSVDGGMTDNIR 51892948_Symbiobacterium_therm LGGGLGVRYVKLIVEPGRSIVAEAGTTLYTVGTIKRTYLSVDGGMGDNIR 16079395_Bacillus_subtil LGGGFGIRYTEIWIEPGRSLVGDAGTTLYTVGSQKRQYVAVDGGMNDNIR : * * : YSMANNYNRIPRPAVVFVENGEAHLVVKRETYEDIVK * Note: All top hits in GenBank searches (using Cyanidioschyzon sequence and Arabidopsis GI 15231844 as queries) are from various non-cyanobacterial groups. The donor of the acquired gene in primary photosynthetic eukaryotes (upper part of the tree) is difficult to pinpoint because of the lack of sufficient internal support on the gene tree, but it is unlikely from cyanobacteria based on AU tests. The Arabidopsis sequence (GenBank GI 15231844 and TAIR locus AT3G14390) is annotated as a chloroplast precursor. Ostreococcus sequence that appears to be distant). These tests investigate if red algal, green plant and glaucophyte sequences have the same origin (mitochondrial or eukaryotic) with other eukaryotic sequences and if red algae, green plants and glaucophytes acquired this gene from plastids. P-values < 0.001 from AU tests for both alternative topologies. -

MGDG synthase (N)
NADFLLEQGVALKAIDDDALVYRIHALLMRRRLAPLGRPLAGRFVLDRV 118063689_Roseiflexus_casten NSDHLLEEGVALRCNQMTTLAYKIDRLLMRENTRNIGRPDAARVIVETL Figure 6. Molecular phylogeny of MGDG synthase. See text for detailed explanation. P-value = 0.235 from AU test on the presented tree. AU tests were also performed an alternative topology enforcing a monophyly of photosynthetic eukaryotic sequences with the rare cyanobacterial homolog (i.e., Gloeobacter in this case). P-value < 0.001 from AU test for the alternative topology.
LGATVGVNFCLFSKHAERVTLLLFNRTFYYWHVFVKAGQVYAYRVDGPHD * . : : : : .:* ** . ** *: : HYLGLCDQSLGDRQRQLLLKTKLAVENLLASLLHDKKVRFILPT Note: All top hits in GenBank searches (using Cyanidioschyzon and Arabidopsis sequences as queries) are from gamma-proteobacteria. Phylogenetic analyses show that red algal and green plant sequences group with beta and gamma-proteobacterial homologs with strong support. The Arabidopsis sequence (GenBank accession number AAM98284 and TAIR locus AT5G66120) is annotated as a chloroplast precursor. Sequences of Euglena and Karlodinium were obtained from TBestDB. Figure 8. Molecular phylogeny of 3-dehydroquinate synthase. P-value = 0.997 from AU test on the presented tree. AU tests were also performed on alternative topologies, including (A) monophyly of all eukaryotic sequences (except for Karlodinium, which appears to be of cyanobacterial origin), (B) monophyly of cyanobacterial, green plant and red algal sequences. These tests investigate if red algal and green plant sequences have the same origin (mitochondrial or eukaryotic) and if they acquired the gene from plastids. P-values < 0.001 from AU tests for both alternative topologies.

9-10. 2-methylthioadenine synthetase (E, D)
CLUSTAL X (1.83.1) multiple sequence alignment 119886282_Thermotoga_petrop  TSYGIDLYLPDLLRRLNSLGEFWIRVMYLHPDHLTEEIISAMLELVVKYF  116620691_Solibacter_usitat  TCYGEDLGLAELLARLAQIQEKWIRFLYAYPNKVTQKLLDTLAEHLAKYI  118443956_Clostridium_novyi  AIYGSDLYLSQLLRELSNIDIEWIRILYTYPEEITDELIEEIKNNVCKYL  32476670_Rhodopirellula_baltic  TYYGMDRYLNQLLKELDKISIDWIRLMYFYPMYIDDALIDTLASAIVPYI  108760706_Myxococcus_xanthu  TAYGHDLPLHDLLKALVQVDVKWIRLHYAYPRIFPDELIEVMASEIARYL  51892771_Symbiobacterium_therm  TYYGLDLYLARLLRELAQVGIRWIRIHYSYPTRITDELIEVIVTEVLNYL  33862448_Prochlorococcus_marin  TNYGLDLYFAELLQALGEVDIPWVRVHYAYPTGLTPEVLAAYREVVLRYL  16331757_Synechocystis_sp.  TNYGLDLYLAELLQALGKVDIPWIRIHYAYPTGLTPKVIEAIRDTVLPYL  15606200_Aquifex_aeolic  TYYGKDLYLVELLEGLEKVGIKWIRLLYLYPTEVHEDLIDYVANSVLPYF  21674219_Chlorobium_tepidu  SVYGYDLYLNDLTLRLSDMGFNWIRLLYAYPLNFPLEVISTMRERVCNYI  34540013_Porphyromonas_gingiv  TFYGLDLYLAELTARLSDIGVEWLRLHYAYPAQFPLDLLPVMRERVCKYL  126662084_Flavobacteria_bacter  TYYGLDLYLAELLENLAKVGIEWIRLHYAFPSGFPMDVLDLMKREICNYI  110637073_Cytophaga_hutchi  TYYGLDIYLSDLLKNLSDVGIDWIRLQYAYPSGFPLDVLDVMAERICKYI  119885033_Thermotoga_petrop  GKYGKDMGLAELLKVIEKVGDYRVRLSSINVEDVNDEIVEAFKRNLCPHL  116622276_Solibacter_usitat  GRWGREPGLAGLLRLLLAEDVARLRLSSVEPMDFSEDLLHLMAASIANHV  34580504_Rickettsia_sibiri  TAYGSDLPFAQMIKRVLNLELKRLRLSSIDVAEIDDELFELIAYSIMPHF  15836008_Chlamydophila_pneumo GDYCDGERLASLIEQVDQIGIERIRISSIDPDDITEDLHRAITSSTCPSS  Figure 9. Molecular phylogeny of 2-methylthioadenine synthetase. Pvalue = 0.983 from AU test for the presented tree. See text for a more detailed discussion. AU tests were also performed on alternative topologies, including (A) miB1 and miaB2 forming a monophyly, (B) miaB1 and miaB2 forming a monophyly that in turn groups with archaeal sequences, and (C) miaB2 forming a monophyly with proteobacterial sequences from the top part of the tree. These tests investigate if (a) miaB1 and miaB2 have the same origin, (b) miaB1 and miaB2 have a eukaryotic origin and are related to archaeal homologs, and (c) miaB2 has a mitochondrial origin. P-values < 0.001 from AU tests for these alternative topologies.  Figure 11. Molecular phylogeny of Uroporphyrinogen-III synthase. Pvalue = 0.959 from the AU test on the presented tree. AU tests were also performed on alternative topologies including (A) monophyly of red algal, green plant and cyanobacterial sequences, and (B) red algal, green plant and Deinococcus sequences forming a monophyly that in turn groups with cyanobacterial homologs. These tests investigate if a) red algae and green plants acquired this genes from plastids, and b) red algae and green plants acquired the genes from plastids and subsequently spread to Deinococcus by secondary HGT. P-value < 0.001 from AU test for topology A and p-value = 0.04 for topology B.

ACT-domain containing protein (N)
CLUSTAL X (1.83.1) multiple sequence alignment

Queuine tRNA-ribosyltransferase (D)
CLUSTAL X (1.83.1) multiple sequence alignment - LGHRPGHERVGLHKMMNWNRSILTDSGGFQMVSVDENGVNFESPHTGEMM 56473322_Entamoeba_histol MFLHPGVDVLGLHQFAKWDGNILTDSGGFQMVSVNEQGVIFQSIVDHKPI 46229743_Cryptosporidium_parvu LGSRPGDEIIGLHNFMRWNRNILTDSGGFQMVSITEEGVEFRHPYTNANL TPO00008041_Trimastix_pyriform - LGLRPGPELIGLHGFMNWPHNLLTDSGGFQMVSVTEEGVRFRSPYDGNET 60463331_Dictyostelium_discoi LGHRPGPEVMGLHKFMNYPRAMLTDSGGFQMVSITEQGVQFQSPHDGSTM 62360604_Trypanosoma_brucei LGLRPGEDILGIHFLQGWKRNILTDSGGFQMVSITEEGVRFQSTHGGGSL 50900348_Oryza_sativa LELRPGSQLIGLHKFMNWKRALLTDSGGFQMVSITEEGVTFQSPVDGKPM JBO00061016_Jakoba_bahamensis - -  TYHNLHFMKNFLTEIQNSIQKGEF  19068923_Encephalitozoon_cunic  TIHNLYYMRSLTRRIRESITEDRY  3881825_Caenorhabditis_elegan  SVHNIKHQLDLMRDVRQAIQSNSV  56473322_Entamoeba_histol  THHNISYLFNLMRKYRVAIREGKS  46229743_Cryptosporidium_parvu  TIHNISFMMEFCNDMRNAIKNQTF  TPO00008041_Trimastix_pyriform  TMHNIAYQMRHMQRIRDAIKGGTF  12597314_Homo_sapien  TVHNIAYQLQLMSAVRTSIVEKRF  60463331_Dictyostelium_discoi  TFHNIHYQMSLMSQIRQSIIDQTF  62360604_Trypanosoma_brucei  SYHNLAYLINLTRGAREAILSGTF  50900348_Oryza_sativa  SYHNLSFMMRLSRDLHMSILEGRF  JBO00061016_Jakoba_bahamensis - CM2163_Cyanidioschyzon_merolae  TLHNIKYMCDVMREIRTRILADEL  46446428_Parachlamydia_sp.  TVHNLYFMVQLMEKYRKQIQEDLI  76788915_Chlamydia_trachomat  SIHNMHHMQKVMREIREGILNDRI  29840339_Chlamydophila_caviae  SIHNLHYMQEVMKNIREQILNDEI  11499080_Archaeoglobus_fulgid  TYHNIYFVVKLMERIRESIADGSF  21674218_Chlorobium_tepidu  TMQNLSFYLWLTRTAREHIAAGDF  48855206_Cytophaga_hutchi  TIQNISFYLWLMREARKQIVAGTF  53713592_Bacteroides_fragil  SIHNLAFYLWLVGEARKHIIAGDF  76258051_Chloroflexus_auranti  SLHNVAFLLNLMADIRSALAAGRF  15644309_Thermotoga_mariti  TIHNINFMISLMKEVRRSIESGTF  6460406_Deinococcus_radiod  SLHNLRYLHRLVERMRVAINGQQF  48891338_Trichodesmium_erythr  SLHNVTELISFTQRIRDAILKDRF  56751072_Synechococcus_elonga  SIHNITELVRFTTRIREAILSDRF  20455294_Nostoc_sp  SIHNITELIRFTQKIREAILSDRF  46106898_Rubrobacter_xylano  SLHNVRFVTELCRSARREILAGTY  42523694_Bdellovibrio_bacter  TIHNIHFYMKVMEKAREAIAQGRW  15611336_Helicobacter_pylori  SLHNLHFYLELVKNARNAILEKRF   15606515_Aquifex_aeolic   TIHNLRFYLKMMEEVRKAIEEKRF  32397050_Rhodopirellula_baltic  SHHNLMYYGRLMQATRDAIEAGEF  39996277_Geobacter_sulfur  SIHNVHFYLNMMAEIRAAIEEERF  23473888_Desulfovibrio_desulf  SIHNLTYFLDLVRGAREAIAQGTF  39997713_Geobacter_sulfur  TYHNLAYYLDLMAQIRTAIAEERF  52006768_Thiobacillus_denitr  TLHNLHYYHRLMAEVRAAIDAQRF  53732897_Haemophilus_influe  TIHNLRYYQRLMAEIRQAIEDDRF  48730213_Pseudomonas_fluore  TIHNLRHYQVLMAGLREAIQQGTL  19715156_Fusobacterium_nuclea  SYHNLYFLIKLMKDAREAIKEKRF  56963323_Bacillus_clausi  TYHNLYFLLNLMKQVRQAIMDDCL  16800633_Listeria_innocu  TYHNLHFLLNLMKQVRSAIMEDRL  39935672_Rhodopseudomonas_palu  SEINIAYYQRLMRDIRAAIAVGQF  52011430_Silicibacter_sp. TWHNLHYFQDIMAGMRESIAAGTF 15604558_Rickettsia_prowaz TWHNLTYFQNLMSRIRTYIKLGKD Note: This is an intriguing case of ancient HGT. The Dictyostelium sequence forms a group with homologs of green and red algae as well as chlamydiae. This specific affiliation of Dictyostelium and plant sequences has been observed in multiple cases (Huang, unpublished data), likely resulting from plant-Dictyostelium transfer. A plausible explanation is that primary photosynthetic eukaryotes acquired this gene from chlamydiae and then further spread to Dictyostelium via secondary gene transfer. Heterocapsa sequence is a chloroplast precursor based on the original GenBank annotation. Heterocapsa and Wolbachia sequences also share indels and many conserved residues. The remaining eukaryotic sequences are much more similar to bacterial than to archaeal homologs, and they are likely of bacterial origin. One possible explanation is that they are derived from mitochondria. Nevertheless, most of these eukaryotic sequences lack a N-terminal extension. Sequences of Trimastix and Jakoba were obtained from TBestDB. Figure 13. Molecular phylogeny of queuine tRNA-ribosyltransferase. Pvalue = 0.369 from the AU test for the presented gene tree. AU tests were also performed on alternative topologies including (A) monophyly of all eukaryotic sequences, and (B) monophyly of red algal, green plant, Dictyostelium, and cyanobacterial sequences. These tests investigate if all eukaryotic sequences have the same origin (mitochondrial or eukaryotic) and if red algal and green plant sequences are likely derived from plastids. P-values < 0.001 from AU tests for both alternative topologies. Note: This gene has identifiable homologs only in prokaryotes and plastid-containing eukaryotes (red algae, green plants, and apicomplexan Plasmodium) in our GenBank and TBestDB searches (using Cyanidioschyzon and Arabidopsis sequences as queries). Homologs of the gene are rarely found in cyanobacteria. All top hits are from gamma and beta-proteobacteria. Figure 14. Molecular phylogeny of SAM-dependent methyltransferase. Pvalue = 0.921 from AU test for the presented gene tree. AU tests were also performed on alternative topologies including (A) monophyly of red algal, green plant and Plasmodium sequences, (B) monophyly of red algal, green plant, Plasmodium and cyanobacterial sequences, and (C) monophyly of red algal, green plant, Plasmodium, and archaeal sequences. Topology A investigates if red algal, green plant, and Plasmodium (which also has a relict plastid) sequences have the same origin. Topology B investigates if the three plastid-containing groups acquired the genes from plastids. Topology C investigates if these plastid-containing eukaryotic sequences have an archaeal or eukaryotic origin. P-value = 0.082 from AU test for topology A whereas p-values < 0.001 for topologies B and C.

Semialdehyde dehydrogenase (D)
CLUSTAL X (1.83.1) multiple sequence alignment Note: This gene is found in green plants, red algae, fungi, Capsaspora and prokaryotes. All top hits in GenBank searches are from alpha-proteobacteria. Phylogenetic analyses also support an alpha-proteobacterial origin of the gene in red algae and green plants.
Fungal and Capsaspora sequences are apparently different (with an about 500 aa N-terminal extension and many conserved residues shared between them) from red algal, green plant and prokaryotic sequences. The Neurospora sequence (GenBank accession number P54898) is experimentally determined to be a mitochondrial precursor. Red algal and green plant sequences share many conserved residues with alpha-proteobacterial homologs. The Arabidopsis sequences (GenBank accession numbers and TAIR loci NP_565461, AT2G19940 and NP_849993, AT2G19940 respectively) are annotated to be located in cytoplasm. Sequence of Capsaspora sequence was obtained from TBestDB. Figure 15. Molecular phylogeny of semialdehyde dehydrogenase. P-value = 0.981 from AU test for the presented tree. AU tests were also performed on alternative topologies including (A) monophyly of all eukaryotic sequences, and (B) monophyly of cyanobacterial, red algal, green plant, and alpha-proteobacterial sequences. Topology A investigates if all eukaryotic sequences have the same origin (mitochondrial or eukaryotic) whereas topology B investigates if red algae and green plants acquired this gene from plastids and further spread to alpha-proteobacteria. Pvalue = 0.045 from AU test for topology A and p-value < 0.001 for topology B.

Note:
The identifiable homologs of this gene (using Cyanidioschyzon and Arabidopsis sequences as queries) are found only in bacteria and eukaryotes, with all top hits being from beta and gamma-proteobacteria. This disjunct distribution suggests that the eukaryotic sequences are likely of bacterial origin. Phylogenetic analyses show a common ancestry of sequences from red algae, green plants and gamma, beta-proteobacteria. The gene was likely transferred independently to two groups of eukaryotes, one to the ancestor of red algae and green plants, another to the bacteriotrophic Reclinomonas.
Reclinomonas and Aedes share several conserved residues and their grouping together likely resulted from eukaryote-to-eukaryote gene transfer. The Arabidopsis gene product is localized in cytoplasm according to GenBank annotation. Sequence of Reclinomonas was obtained from TBestDB. Figure 18. Molecular phylogeny of ribosomal protein L11 methyltransferase. P-value = 0.959 from AU test for the presented gene tree. AU tests were also performed on alternative topologies including (A) monophyly of all eukaryotic sequences, and (B) monophyly of red algal, green plant, and cyanobacterial sequences. P-value < 0.001 from AU tests for topology A and p-value = 0.041 for topology B.

tRNA methyltransferase (D)
Note: This gene appears to be restricted to bacteria and eukaryotes. All top hits of GenBank searches are from Borrelia, Lentisphaeria, and CFB bacteria. Protein product of the Arabidopsis sequence (GenBank accession number NP_175542 and TAIR locus AT1G51310) is localized in both chloroplasts and cytoplasm. The major eukaryotic sequence group contains some mitochondrial precursors (e.g. Homo sequence) and is likely of mitochondrial origin, although it is not particularly related to alpha-proteobacterial homologs. Please also note that the green alga Ostreococcus contains two versions of this gene, one of which groups with the eukaryotic mitochondrial clade while the other version with sequences of red algae, Theileria, CFB bacteria and spirochaetes. Sequences of Reclinomonas and Hartmanella were obtained from TBestDB; these two sequences formed a group with Homo, Dictyostelium, and other eukaryotic sequences in preliminary phylogenetic analyses, but were excluded from detailed phylogenetic analyses because of their short length. Theileria is an apicomplexan parasite containing a plastid derived from an algal endosymbiont. Figure 21. Molecular phylogeny of tRNA methyltransferase. P-value = 0.977 from AU test for the presented tree. AU tests were also performed on alternative topologies including (A) monophyly of all eukaryotic sequences, and (B) monophyly of cyanobacterial, red algal, and green plant sequences. These alternative topologies investigate if red algae and green plants acquired the genes from mitochondria or plastids respectively. P-values < 0.001 from AU tests for both alternative topologies.  VLSFWNDSRAFERSNFTFYDGPPFATGLPHYGHLLAGTIKDTVTRYAYQT  118380025_Tetrahymena_thermo  ILKFWDEINAFKQQLFTFYDGPPFATGLPHYGNLLAGTIKDVVCRYASQN  23619270_Plasmodium_falcip  ILKYWEDIDAFNLSNYIFYDGPPFATGLPHYGHLLAGIIKDCVTRYFYQS  71076452_Giardia_lambli  MLNYWDQIQAFETQLFNFYDGPPFATGLPHYGHLLAGTIKDVVCRYYSMN  42523714_Bdellovibrio_bacter  ILDFWDQEKIFAQSLYSFYDGPPFATGLPHYGHLLAGVLKDVVPRYWTMK  6319395_Saccharomyces_cerevi  VLSLWDEIDAFHTSLFSFFDGPPFATGTPHYGHILASTIKDIVPRYATMT  66816517_Dictyostelium_discoi  ILKYWDDIKAFETSVYSFYDGPPFATGLPHYGHILAGTIKDTITRYAHQT  24668543_Drosophila_melano  VLQKWRHENIFEKCSYTFYDGPPFATGLPHYGHILAGTIKDIVTRYAYQQ  94721239_Homo_sapien  ILEFWTEFNCFQECLFTFYDGPPFATGLPHYGHILAGTIKDIVTRYAHQS  116061521_Ostreococcus_tauri  VLKLWEEIDAFGQQLFVFYDGPPFATGLPHYGHLLAGTIKDIVTRFASTT  30681405_Arabidopsis_thalia  VLSFWTEIDAFKTQLYIFYDGPPFATGLPHYGHILAGTIKDIVTRYQTMT  67481173_Entamoeba_histol  VIEFWKKIDVFNKCNFSFYDGPPFATGLPHYGHLLAGTIKDTVCRYAIQT  57157197_Trichomonas_vagina ------------------------ --GTPHYGHILAGTIKDVVTRYAYQT  48477483_Picrophilus_torrid  ILNYWKENHIDENIIFAFLEGPPTANGRPHVGHLMTRAVKDTVMRYKYMT  84029573_Thermoplasma_acidophi  ILKYWKDKNILEKILFVFLEGPPTANGRPHIGHAMTRTIKDIVLRYNTMT  60681847_Bacteroides_fragil  VLKKWDENQVFAKSMFVFFEGPPSANGMPGIHHVMARSIKDIFCRYKTMK  83815822_Salinibacter_ruber  VLDWWQDQNIFERSIFTFYEGPPTANGTPGIHHVLARAIKDIFCRYKTMQ  21673150_Chlorobium_tepidu  IREFWIERNIFRKSLYSFYEGPPTVNGKPGVHHLFSRTIKDVVCRYHTMQ  111221641_Frankia_alni  TLARWRDAKVFHRSLWVFYEGPPTANGKPGAHHVEARVFKDLFPRYRTMK  28210025_Clostridium_tetani  VLDFWNKNNIVDKSFFTFYDGPPTANGKPHVGHVLTRVIKDLIPRYKVMK  47570085_Bacillus_cereus  IRKQWNEQSIFEQSIFVFYEGPPTANGLPHVGHALGRTIKDVVARYKTMA  87311784_Blastopirellula_marin  VLDFWKSQKIYEKSLFVFYEGPPTANGMPHPGHCLTRAIKDVFPRYRTMK  94985204_Deinococcus_geothe  ILNFWQENRIFEQTQFVFYEGPPTANGRPALHHVLARSFKDLFPRYKVMQ  VLPELVKFIDQLIN------------------------------------ Note: There are two eukaryotic sequence clades for this gene, each of which clusters within bacterial homologs with strong support. The minor eukaryotic sequence clade (lower part of the tree) contains chloroplast precursors from plants and mitochondrial precursors from opisthokonts. Sequences in the major eukaryotic sequence clade are cytosolic. It is likely that the major eukaryotic sequence clade resulted from an ancient HGT event prior to the split of most eukaryotic super groups. An alternative explanation is that the common ancestor of cellular organisms contained two copies of this gene, which were differentially retained among lineages. Figure 22. Molecular phylogeny of isolecucyl-tRNA synthetase. P-value = 0.235 from AU test for the presented tree. AU test was also performed on an alternative topology enforcing a sequence monophyly of archaea and the major eukaryotic group. Such an alternative topology is based on the common belief that archaea and eukaryotes are more closely related than each is to bacteria. P-value < 0.001 from AU test for the alternative topology.