Low frequency of paleoviral infiltration across the avian phylogeny

Background Mammalian genomes commonly harbor endogenous viral elements. Due to a lack of comparable genome-scale sequence data, far less is known about endogenous viral elements in avian species, even though their small genomes may enable important insights into the patterns and processes of endogenous viral element evolution. Results Through a systematic screening of the genomes of 48 species sampled across the avian phylogeny we reveal that birds harbor a limited number of endogenous viral elements compared to mammals, with only five viral families observed: Retroviridae, Hepadnaviridae, Bornaviridae, Circoviridae, and Parvoviridae. All nonretroviral endogenous viral elements are present at low copy numbers and in few species, with only endogenous hepadnaviruses widely distributed, although these have been purged in some cases. We also provide the first evidence for endogenous bornaviruses and circoviruses in avian genomes, although at very low copy numbers. A comparative analysis of vertebrate genomes revealed a simple linear relationship between endogenous viral element abundance and host genome size, such that the occurrence of endogenous viral elements in bird genomes is 6- to 13-fold less frequent than in mammals. Conclusions These results reveal that avian genomes harbor relatively small numbers of endogenous viruses, particularly those derived from RNA viruses, and hence are either less susceptible to viral invasions or purge them more effectively. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0539-3) contains supplementary material, which is available to authorized users.


Background
Vertebrate genomes commonly harbor retrovirus-like [1] and non-retrovirus-like [2] viral sequences, resulting from past chromosomal integration of viral DNA (or DNA copies of viral RNA) into host germ cells. Tracing the evolutionary histories of these endogenous viral elements (EVEs) can provide important information on the origin of their extant counterparts, and provide an insight into host genome dynamics [3][4][5][6][7]. Recent studies have shown that these genomic 'fossils' can also influence the biology of their hosts, both beneficially and detrimentally; for example, by introducing novel genomic rearrangements, influencing host gene expression, as well as evolving into new protein-coding genes with cellular functions (that is, 'gene domestication') [4,6].
Because integration into host genomes is intrinsic to the replication cycle of retroviruses which employ reverse transcriptase (RT), it is no surprise that retroviruses are commonly found to have endogenous forms in a wide range of animal genomes [8]. Indeed, most of the EVEs present in animal genomes are of retroviral origin -endogenous retroviruses (ERVs) -and EVEs representing all retroviral genera, with the exception of Deltaretrovirus, have been found to possess endogenous forms. Remarkably, recent studies have revealed the unexpected occurrence of non-retroviral elements in various animal genomes, including RNA viruses that lack a DNA form in their replication cycle [2,6]. Since their initial discovery, EVEs in animal genomes have been documented for families of double-stranded (ds)DNA viruses (virus classification Group I) -Herpesviridae; single-stranded (ss)DNA viruses (Group II) -Circoviridae and Parvoviridae; ssRNA viruses (Group IV) -Bornaviridae and Filoviridae; ssRNA-RT viruses (Group VI) -Retroviridae; and dsDNA-RT viruses (Group VII) -Hepadnaviridae [6].
To date, most studies of animal EVEs have focused on mammals due to their relatively high density of sampling.
In contrast, few studies on the EVEs present in avian species have been undertaken. The best-documented avian EVEs are endogenous hepadnaviruses. These virally derived elements were first described in the genome of a passerine bird -the zebra finch [9] -and then in the genome of the budgerigar [10] as well as some other passerines [11], and may have a Mesozoic origin in some cases [11]. Also of note was the discovery of a great diversity of ERVs in the genomes of zebra finch, chicken and turkey, most of which remain transcriptionally active [12]. In contrast, most mammalian ERVs are inert.
In this study, we systematically mined 48 avian genomes for EVEs of all viral families, as one of a body of companion studies on avian genomics [13,14]. Importantly, our data set represents all 32 neognath and two of the five palaeognath orders, and thus represents nearly all major orders of extant birds. Such a large-scale data analysis enabled us to address a number of key questions in EVE evolution, namely (i) what types of viruses have left such genomic fossils across the avian phylogeny and in what frequencies, (ii) what are the respective frequencies of EVE inheritance between species and independent species-specific insertion, and (iii) what is the frequency and pattern of avian EVE infiltration compared with other vertebrates?

Genome scanning for avian endogenous viral elements
Our in silico genomic mining of the 48 avian genomes [13,14] (Table S1 in Additional file 1) revealed the presence of five families of endogenous viruses -Retroviridae, Hepadnaviridae, Circoviridae, Parvoviridae, and Bornaviridae ( Figure 1), almost all of which (>99.99%) were of retroviral origin. Only a single family of RNA viruses (Group IV; the Bornaviridae) was present. Notably, three closely related oscine passerine birds -the American crow, medium ground-finch and zebra finch -possessed greater ERV copy numbers in their genomes than the avian average (Table 1; discussed in detail below), while their suboscine passerine relatives -rifleman and golden-collared manakinpossessed lower ERV numbers close to the avian average (Table 1) and occupied basal positions in the passerine phylogeny ( Figure 1). Hence, there appears to have been an expansion of ERVs coincident with the species radiation of the suborder Passeri.
We next consider each of the EVE families in turn.

Endogenous viral elements related to the Retroviridae
As expected, ERVs were by far the most abundant EVE class in the avian genomes, covering the genera Alpha-, Beta-, Gamma-, and Epsilonretrovirus, with total ERV copy numbers ranging from 132 to 1,032. The greatest numbers of ERVs were recorded in the three oscine passerines (American crow, medium ground-finch and zebra finch, respectively) that exhibited EVE expansion (Table 1). ERVs related to beta-and gammaretroviruses were the most abundant in all avian genomes as noted in an important earlier study of three avian genomes [12]. In contrast, ERVs derived from epsilonretroviruses were extremely rare, with very few copies distributed (Additional file 2). We also found that ERVs related to alpharetroviruses were widely distributed in avian phylogeny, although with very low copy numbers [12]. In accord with the overall genetic pattern among the EVEs, the three oscine passerines exhibited greater numbers of ERVs than other taxa (two-to three-fold higher than the average; Table 1). This suggests that an ERV expansion occurred in the oscine passerines subsequent to their split from the suboscines. Phylogenetic analysis revealed that this pattern was due to frequent invasions of similar beta-and gammaretroviruses in these species (Table 1; Additional file 2). Strikingly, the avian and non-avian (American alligator, green turtle and anole lizard) genomes seldom shared orthologous sequences (that is, only a few avian sequences can be aligned with those of non-avians and without matching flanking regions) and all their ERVs were distantly related (Additional file 2), indicative of a lack of vertical or horizontal transmission among these vertebrates. In addition, no non-retroviral elements were found in the non-avian genomes using our strict mining pipeline.

Endogenous viral elements related to the Hepadnaviridae
Hepadnaviruses have very small genomes (approximately 3 kb) of partially double-stranded and partially singlestranded circular DNA. Their replication involves an RNA intermediate that is reverse transcribed in the cytoplasm and transported as cDNA back into the nucleus. Strikingly, we found endogenous hepadnaviral elements in all the avian genomes studied (Table S2 in Additional file 1), such that they were the most widely distributed non-retroviral EVEs recorded to date. In this context it is important to note that no mammalian endogenous hepadnaviruses have been described even though primates are major reservoirs for exogenous hepatitis B viruses [15].
Our phylogenetic analysis revealed a number of notable evolutionary patterns in the avian endogenous hepadnaviruses: (i) endogenous hepadnaviruses exhibited a far greater phylogenetic diversity, depicted as diverse clades, than their exogenous relatives (Additional file 3), suggesting they were older, although an acceleration in evolutionary rates among some hepadnaviral EVEs cannot be excluded; (ii) exogenous hepadnaviruses formed a tight monophyletic group compared with the endogenous elements (Additional file 3), indicative of a turnover of exogenous viruses during avian evolution; (iii) there was a marked difference in copy number (from 1 to 68) among avian species (Table S2 in Additional file 1), suggestive of the frequent gain and loss of viruses during avian evolution; and (iv) there was a phylogeny-wide incongruence between the virus tree (Additional file 3) and the host tree (P = 0.233 using ParaFit method), indicative of multiple independent genomic integration events as well as potential cross-species transmission events.
Despite the evidence for independent integration events, it was also clear that some hepadnavirus EVEs were inherited from a common ancestor of related avian groups, and perhaps over deep evolutionary time-scales. We documented these cases by looking for pairs of endogenous hepadnaviruses from different avian hosts that received strong (>70%) bootstrap support (Data S1 in Additional file 4) and which occupied orthologous locations. Specifically: (i) in the genomes of the white-tailed and bald eagles, the 5′ end of an hepadnavirus EVE was flanked by a same unknown gene while the 3′ end was flanked by the dendritic cell immunoreceptor (DCIR) gene (Additional file 3); (ii) an EVE shared by the emperor penguin and Adelie penguin (Additional file 3) was flanked by a same unknown gene at the 5′ end and the Krueppel-like factor 8-like gene at the 3′ end; and (iii) the ostrich and the great tinamou had the same  flanking genes, albeit of unknown function, at both ends of an EVE. We also recorded a rare case of vertical transmission of a hepadnavirus with a complete genome that has seemingly been inherited by 31 species (Table S2 in Additional file 1) prior to the diversification of the Neoaves 73 million years ago [14]. This virus has been previously denoted as eZHBV_C [11], and was flanked by the furry homolog (FRY) gene at both the 5′ and 3′ ends. Our hepadnavirus phylogeny ( Figure 2) showed that this EVE group clustered tightly with extremely short internal branches, although with some topological patterns that were inconsistent with the host topology ( Figure 1). A lack of phylogenetic resolution notwithstanding, this mismatch between the virus and host trees could be also in part be due to incomplete lineage sorting, in which there has been insufficient time for allele fixation during the short time period between bird speciation events. Indeed, Neoaves are characterized by a rapid species radiation [16].
Strikingly, we observed that two Galliformes species, chicken and turkey, have seemingly purged their hepadnaviral EVEs. Specifically, genomic mining revealed no hepadnaviral elements in these galliformes, even though their closest relatives (Anseriformes) contained such elements. In support of this genome purging, we noted that one hepadnaviral element present in the mallard genome has been severely degraded through frequent mutation in the chicken genome (Additional file 5). In addition, remnants of orthologous 5′ and 3′ regions could also be found in the turkey genome, although the rest of the element was deleted (Additional file 5).

Endogenous viral elements related to the Bornaviridae
Bornaviruses (family Bornaviridae) are linear, unsegmented negative-sense ssRNA viruses with genomes of approximately 9 kb. They are unusual among animal RNA viruses in their ability to replicate within the host cell nucleus, which in turn assists endogenization. Indeed, orthomyxoviruses and some insect rhabdoviruses also replicate in the nucleus and both have been found to occur as endogenous forms in insect genomes [2]. Endogenous elements of bornaviruses, denoted endogenous bornavirus-like N (EBLN) [2,17,18] and endogenous bornavirus-like L (EBLL) [2,18], have been discovered in mammalian genomes, including humans, and those present in primates have been dated to have arisen more than 40 million years ago [17,18]. Although exogenous bornaviruses circulate in both mammals and birds and cause fatal diseases [19,20], endogenous bornaviruses have not yet been documented in avian species.
We report, for the first time, that both EBLN and EBLL are present in several avian genomes (Additional file 6), although in only three species and with very low copy numbers (1 to 4; Table S3 in Additional file 1): the Anna's hummingbird, the closely related chimney swift, and the more distantly related woodpecker. Both EBLN and EBLL in the genome of Anna's hummingbird were divergent compared with other avian or mammalian viruses. The chimney swift possessed a copy of EBLN, which was robustly grouped in the phylogenetic tree with the EVE present in Anna's hummingbird ( Figure  S4A in Additional file 6). However, as these viral copies did not share the same flanking regions in the host genomes, as well as the inconsistent phylogenetic positions of the EBLN ( Figure S4A in Additional file 6) and EBLL ( Figure S4C in Additional file 6) of Anna's hummingbird, they likely represent independent integration events. In addition, due to the close relationships among some of the viruses in different species, it is possible that crossspecies transmission has occurred because of shared geographical distributions (for example, woodpeckers are widely distributed across the United States, with geographic distributions that overlap with those of Anna's hummingbirds). The EBLN in the downy woodpecker was likely to have entered the host genome recently as in the phylogenetic tree it was embedded within the genetic diversity of exogenous viruses; the same pattern was observed in the case of the two viral copies in the genome of Anna's hummingbird ( Figure S4B in Additional file 6). Similar to previous studies in mammals [21], we found that more species have incorporated EBLN than EBLL. However, compared with their wide distribution in mammalian genomes, it was striking that only three avian species carried endogenous bornavirus-like elements.

Endogenous viral elements related to the Circoviridae
Circoviruses (family Circoviridae) possess approximately 2 kb ssDNA, nonenveloped and unsegmented circular genomes, and replicate in the nucleus via a rolling circle mechanism. They are known to infect birds and pigs and    can cause a wide range of severe symptoms such as Psittacine circovirus disease. There are two main open reading frames, usually arranged in an ambisense orientation, that encode the replication (Rep) and capsid (Cap) proteins. Endogenous circoviruses (eCiVs) are rare, and to date have only been reported in four mammalian genomes, with circoviral endogenization in carnivores dating to at least 42 million years [22].
We found circoviruses to be incorporated into only four avian genomes -medium ground finch, kea, egret, and tinamou -and at copy numbers of only 1 to 2 (Additional file 7; Table S5 in Additional file 1). There were at least two divergent groups of eCiVs in the viral phylogenetic tree, one in the medium ground-finch and great tinamou ( Figure S5A-C in Additional file 7), which was closely related to exogenous avian circoviruses, and another in the little egret and kea ( Figure  S5C,D in Additional file 7), which was only distantly related to avian exogenous counterparts. The large phylogenetic distances among these endogenous viruses are suggestive of independent episodes of viral incorporation. In addition, two pieces of evidence strongly suggested that eCiVs in the medium ground-finch and great tinamou ( Figure S5A-C in Additional file 7) have only recently entered host genomes: (i) they had close relationships with their exogenous counterparts, and (ii) they maintained complete (or nearly complete) open reading frames (Table S5 in Additional file 1).

Endogenous viral elements related to the Parvoviridae
The family Parvoviridae comprises two subfamilies -Parvovirinae and Densovirinae -that infect diverse vertebrates and invertebrates, respectively. Parvoviruses typically possess linear, non-segmented ssDNA genomes with an average size of approximately 5 kb, and replicate in the nucleus. Parvoviruses have been documented in a wide range of hosts, including humans, and can cause a range of diseases [23]. Recent studies revealed that endogenous parvoviruses (ePaVs) have been broadly distributed in mammalian genomes, with integration events dating back at least 40 million years [22].
We found multiple entries of ePaVs with very low copy numbers (1 to 3; Table S5 in Additional file 1) in 10 avian genomes (Additional file 8), and they were not as widely distributed as those parvoviruses present in mammalian genomes [22]. All avian ePaVs were phylogenetically close to exogenous avian parvoviruses with the exception of a single one from the brown mesite, which was distantly related to all known animal parvoviruses (Additional file 8). We also found several cases of apparently vertical transmission. For example, one common ePaV in the American crow and rifleman was flanked by the same unknown host gene; the viral copy in the golden-collared manakin and zebra finch was flanked by the tyrosine-protein phosphatase non-receptor type 13 (PTPN13) gene at the 5′ end and the same unknown gene at the 3′ end; and one viral element in the little egret and Dalmatian pelican was flanked by a same chicken repeat 1 (CR1) at the 5′ end and collagen alpha 1 gene (COL14A1) at the 3′ end (Data S2 in Additional file 4). These findings suggest both independent integration and vertical transmission (that is, common avian ancestry) for ePAVs that have seemingly existed in birds for at least 30 million years (that is, the separation time of Corvus and Acanthisitta [14]).

Low frequency of retroviral endogenous viral elements in bird genomes
To determine the overall pattern and frequency of infiltration of EVEs in the genomes of birds, American alligator, green turtle, anole lizard, and mammals, we documented the phylogeny-wide abundance of long terminal repeat (LTR)-retrotransposons of retrovirus-like origin [24]. As retroviral elements comprise >99.99% of avian EVEs they obviously represent the most meaningful data set to explore patterns of EVE evolution. This analysis revealed that retroviral EVEs are far less common in birds than in mammals: the average retroviral proportion of the genome was 1.12% (range 0.16% to 3.57%) in birds, 2.39% to 11.41% in mammals, and 0.80% to 4.26% in the genomes of American alligator, green turtle and anole lizard (Tables S6 and S7 in Additional file 1). Strikingly, there was also a simple linear relationship between host genome size and EVE proportion (R 2 = 0.787, P = 0.007; Figure 3). Of equal note was the observation that EVE copy numbers in bird genomes were an order of magnitude less frequent than in mammals (Figure 4; Tables S6 and S7 in Additional file 1), and that the relationship between viral copy number and host genome size exhibited a linear trend (R 2 = 0.780, P < 0.001). Importantly, in all cases (that is, genome size versus proportion and genome size versus copy number) (See figure on previous page.) Figure 2 Phylogenetic tree of exogenous and endogenous hepadnaviruses generated using complete polymerase (P) protein sequences. Bootstrap values lower than 70% are not shown; single asterisks indicate values higher than 70%, while double asterisks indicate values higher than 90%. Branch lengths are drawn to a scale of amino acid substitutions per site (subs/site). The tree is midpoint rooted for purposes of clarity only. The exogenous hepadnaviruses are marked. A cartoon of a virus particle marks the phylogenetic location of an inherited hepadnavirus invasion. Avian host species names are used to denote avian endogenous hepadnaviruses and scaffold numbers are given in Table S2 in Additional file 1. All abbreviations are given in Table S9 in Additional file 1. HBV, hepatitis B virus.
we employed phylogenetic regression analyses to account for the inherent phylogenetic non-independence of the data points.

Discussion and conclusions
Although a diverse array of viruses can possess endogenous forms [2], our analysis revealed that they are uncommon in avian genomes, especially those derived from RNA viruses. Indeed, among RNA viruses, we found only bornavirus endogenized forms occurred in avian genomes, and these had a sporadic distribution and very low frequencies. Although bird genomes are approximately one-third to one-half the size of those of mammals [25,26], the proportion of their genomes that comprises EVEs and their EVE copy numbers are 6 and 13 times less frequent, respectively. It is generally acknowledged that the genome size reduction associated with flying avian species evolved in the asurischian dinosaur lineage [25]. Our broad-scale genomic screening also suggested that a low frequency of EVEs was an ancestral trait in avian lineage, especially in the case of ERVs, such that there has been an expansion of EVE numbers in mammals concomitant with an increase in their genome sizes. Also of note was that although some genomic integration events in birds were vertical, allowing us to estimate an approximate time-scale for their invasion over many millions of years, by far the most common evolutionary pattern in the avian data was the independent integration of EVEs into different species/ genera.
There are a variety of reasons why EVE numbers could be so relatively low in avian genomes. First, it is theoretically possible that birds have been exposed to fewer viral infections than mammals. However, this seems unlikely as, although they are likely to have been examined less intensively than mammals [27], exogenous viruses of various kinds are found in avian species (for example, Coronaviridae, Flaviviridae, Hepadnaviridae, Orthomyxoviridae, Paramyxoviridae, Poxviridae, Retroviridae). In addition, the most common phylogenetic pattern we noted was that of independent integration, suggesting the presence of diverse exogenous infections. However, it is notable that mammals apparently harbor a more diverse set of exogenous retroviruses than birds, as well as a greater abundance of ERVs, which is indicative of a deepseated evolutionary interaction between host and virus [28]. For example, the only gammaretrovirus known in birds is reticuloendotheliosis virus (REV), and a recent study suggested that avian REVs have a mammalian origin [29]. This is consistent with our observation that there are no endogenized forms of REVs among this diverse set of avian genomes. It is also possible that birds are in some way refractory to EVE integration following viral infection. ERVs can replicate both as retrotransposons and as viruses via infection as well as re-infection. Although bird cells are known to be susceptible to certain retroviruses [1], the replication of avian ERVs within the host genome could be suppressed, at least in part, by host-encoded factors. However, a general conclusion of our study is that nonretroviral EVEs are seemingly rare in all vertebrates, such that their integration appears to be generically difficult, and the relative abundance of endogenous retroviruses in birds (albeit low compared with mammals) indicates that they are able to enter bird genomes, with some being actively transcribed and translated [12]. Our observation of a lineage-specific ERV expansion in three passerines also argues against a general refractory mechanism.
A third explanation is that birds are particularly efficient at purging EVEs especially for viruses with retroviral origin from their genomes, a process that we effectively 'caught in the act' in the case of the galliform hepadnaviruses. Indeed, our observation of a very low frequency of LTR-retrotransposons in avian genomes may reflect the action of a highly efficient removal mechanism, such as a form of homologous recombination. Hence, it is likely that active genome purging must be responsible for some of the relative absence of EVEs in birds, in turn retaining a selectively advantageous genomic compactness [30]. Clearly, additional work is needed to determine which of these, or other mechanisms, explain the low EVE numbers in avian genomes.

Genome sequencing and assembly
To systematically study endogenous viral elements in birds, we mined the genomes of 48 avian species (Table S1 in Additional file 1). Of these, three genomes -chicken [31], zebra finch [32] and turkey [33] -were downloaded from Ensembl [34]. The remaining genomes were acquired as part of our avian comparative genomics and phylogenomics consortium [13,14]. All genomes can be obtained from our two databases: CoGe [35] and Phylogenomics Analysis of Birds [36]. American alligator, green  Table S6 in Additional file 1, and the order among the American alligator, green turtle, anole lizard, and mammals given in Table S7 in Additional file 1. Asterisks indicate three oscine passerines showing an EVE expansion.
turtle, anole lizard, and 20 mammal genomes (Table S7 in Additional file 1) were downloaded from Ensembl [34] and used for genomic mining and the subsequent comparative analysis.

Genomic mining
Chromosome and whole genome shotgun assembles [13,[34][35][36] of all species (Table S1 in Additional file 1) were downloaded and screened in silico using tBLASTn and a library of representative viral protein sequences derived from Groups I to VII (dsDNA, ssDNA, dsRNA, +ssRNA, -ssRNA, ssRNA-RT, and dsDNA-RT) of the 2009 ICTV (International Committee on Taxonomy of Viruses) [37] species list (Additional file 9). All viral protein sequences were used for genomic mining. Host genome sequences that generated high-identity (E-values <1e -5 ) matches to viral peptides were extracted. Matches similar to host proteins were filtered and discarded. The sequences were considered virus-related if they were unambiguously matched viral proteins in the NCBI nr (non-redundant) database [38] and the PFAM database [39]. The putative viral gene structures were inferred using GeneWise [40]. The in silico mining of LTR-retrotransposons was performed using RepeatMasker [41].

Statistical analysis
To account for the phylogenetic relationships of avian taxa when investigating patterns of EVE evolution we employed phylogenetic linear regression as implemented in R [46]. Specifically, using Mesquite [47] we manually created a tree that matched the host vertebrate phylogeny [14,48]. For the subsequent phylogenetic regression analysis we utilized the 'phylolm' package in R [49], which provides a function for fitting phylogenetic linear regression and phylogenetic logistic regression. The extent of co-divergence between viruses and hosts was tested by using ParaFit [50], as implemented in the COPYCAT package [51]. The significance of the test was derived from 99,999 randomizations of the association matrix.