Microbial reference genomes for human metagenomics

Deep exome resequencing is a powerful approach for delineating patterns of protein-coding variation among genes, pathways, individuals and populations. We analyzed exome data from 2,440 individuals of European and African ancestry as part of the National Heart, Lung, and Blood Institute’s Exome Project, the aim of which is to discover novel genes and mechanisms that contribute to heart, lung and blood disorders. Each exome was sequenced to a mean coverage of 116×, allowing detailed inferences about the population genomic patterns of both common variation and rare coding variation. We identifi ed more than 500,000 single nucleotide variations, the majority of which were novel and rare (76% of variants had a minor allele frequency of less than 0.1%), refl ecting the recent dramatic increase in the size of the human population. The unprecedented magnitude of this dataset allowed us to rigorously characterize the large variation in nucleotide diversity among genes (ranging from 0 to 1.32%), as well as the role of positive and purifying selection in shaping patterns of proteincoding variation and the diff erential signatures of population structure from rare and common variation. This dataset provides a framework for personal genomics and is an important resource that will allow inferences of broad importance to human evolution and health.

when MI occurs early in life. Genome-wide association studies (GWAS) have identifi ed at least 30 common variants associated with MI, but the modest proportion of overall heritability that these account for suggests that variants that are low in frequency (0.5 to 5% frequency) or rare (<0.5% frequency) may contribute to the risk of early onset MI (EOMI). To test the hypothesis that rare coding mutations contribute to EOMI risk, we are sequencing the protein-coding region -the exome -of about 1,100 cases with EOMI (MI in men ≤50 and women ≤60) and about 1,100 controls free of MI. Using next-generation sequencing, we have targeted 32.7 Mb at 188,260 exons from 18,560 genes. In the fi rst 970 exomes sequenced, we have generated about 6 billion bases of sequence per individual. Each targeted base was read, on average, 146 times; for each individual, approximately 87% of all bases were covered with at least 20× depth. We performed burden-of-rare-variant tests, carried out single SNP (single nucleotide polymorphism) association tests and imputed exomic variants into completed MI GWAS datasets. In the burden-of-variant tests, we found an excess of rare mutations (all nonsynonymous with a minor allele frequency <1% (T1) or 5% (T5)) in several genes, including CHRM5 (P = 0.0001 for T1), DKK2 (P = 0.0003 for T5) and LRIG2 (P = 0.0002 for T5). In single SNP association tests, we rediscovered a known nonsense mutation in PCSK9 that confers protection against MI (0 in cases and 6 in controls, in 466 cases and 504 controls). In imputation using EOMI exomes as the reference panel, we rediscovered the association of a known low-frequency missense SNP in LPA (I4399M, 2% allele frequency, P < 5 × 10 -8 ). We are replicating fi ndings from the discovery study by using three approaches: Sanger sequencing in independent samples (500 cases and 500 controls) of specifi c genes with the signal based on a burden of rare mutations; genotyping of 212 low-frequency SNPs in >10,000 independent MI cases and controls; and imputation of exomic variants into >35,000 MI cases and controls with GWAS data. These replication results should provide insight into the role of rare variants in conferring MI risk and the role of exome sequencing in understanding the inherited basis of complex traits.

Next-generation clinical sequencing in a children's hospital
Stephen Kingsmore Children's Mercy Hospital, Kansas City, MO, USA Genome Biology 2011, 12(Suppl 1):I21 Next-generation sequencing and analysis tools are reaching the mature stage at which mentioning their usefulness for clinical testing is not an oxymoron. Indeed, next-generation clinical sequencing has the potential to transform children's health care, because inherited illnesses account for much of the childhood disease burden. I will discuss the fi rst year of the integration of genomic medicine for Mendelian diseases at Children's Mercy Hospital. Next-generation DNA sequencing has dramatically aff ected cancer genomics eff orts in several important ways. Although whole genome sequencing remains an analytical challenge, such eff orts are yielding data that elucidate the myriad ways in which a genome can be infl uenced by single point mutations, focused insertions or deletions, and large structural alterations. In addition to cataloguing somatic alterations, various correlation analyses are indicating the genes whose alterations most profoundly determine patient outcomes, patient responses to therapeutics and other important aspects of disease biology. We have recently begun exploring how the digital nature of next-generation sequencing reads allows important information about tumor cell genomic heterogeneity to be inferred, revealing the earliest mutations and how the composition of the tumor cell mass changes over time under the infl uence of stressors such as chemotherapy.
Over the past fi ve years, a new generation of technologies has reduced the cost of DNA sequencing by more than four orders of magnitude, democratizing the fi eld by putting the sequencing capacity of a major genome center in the hands of individual investigators [1]. To exploit this paradigm shift, we have developed new technical methods and analytical strategies for disease gene discovery based on whole exome and whole genome sequencing. Our results to date include proof of concept [2] and the fi rst demonstration [3] that exome sequencing of a small number of individuals can be applied to solve Mendelian, single-gene, disorders such as Miller syndrome [3] and Kabuki syndrome [4]. Recently, we have also demonstrated that exome or genome sequencing of parent-child trios can be used to rapidly identify candidate genes for complex disorders such as autism [5]. We are currently extending these strategies to additional simple and complex diseases of unknown etiology. microbiome, these microbial cells outnumber human cells in the body by more than 10 to 1, and the genes carried by these organisms outnumber the genes in the human genome by more than 100 to 1. How these organisms contribute to and aff ect human health is poorly understood, but the emerging fi eld of metagenomics promises a more comprehensive and complete understanding of the human microbiome.
In the European-funded Metagenomics of the Human Intestinal Tract (MetaHIT) project [1], we combined next-generation sequencing with highdensity microarrays, generating metagenomic and metatranscriptomic data for more than 400 individuals. The combined data reveal clusters of coexisting species with diff erences in pathway and gene function activity, suggesting that there is a division of labor between the bacterial species in the human gut microbiome. Diseases of the vaginal tract result from perturbations of the complex interactions among microbes of the host vaginal ecosystem. Recent advances in our understanding of these complex interactions have been enabled by next-generation-sequencing-based approaches, which make it possible to study the vaginal microbiome. In harnessing these approaches, we are beginning to defi ne what constitutes an imbalance of the vaginal microbiome and how such imbalances, along with associated host factors, lead to infection and disease states such as bacterial vaginosis (BV), preterm births, and susceptibility to HIV and other sexually acquired infections. We have exploited various approaches to this end: comparative analysis of reference microbial genomes of vaginal isolates; comparative microbiome, metabolome and metagenome analysis of vaginal communities from subjects deemed to be healthy and individuals with BV; and comparative microbiome analysis of vaginal communities from humans and non-human primate species. The results from comparative genome sequencing have led us to suggest that diff erent strains of the proposed pathogen Gardnerella vaginalis have diff erent virulence potentials and that the detection of G. vaginalis in the vaginal tract is not indicative of a disease state [1]. Comparative microbiome, metabolome and metagenome analysis of vaginal communities from humans has demonstrated that the microbial communities from subjects with BV have a defi ned bacterial composition and metabolic profi le that is distinct from subjects who do not have BV [2 and unpublished observations].
Our studies of microbial communities from non-human primate species and humans provide a unique comparative context. From an evolutionary perspective, humans and non-human primates diff er considerably in mating habits, estrus cycles and gestation period. Moreover, birth is diffi cult in humans relative to other primates, increasing the risks of maternal injury and infection. In light of these numerous diff erences between humans and non-human primates, we hypothesize that humans have microbial populations that are distinct from those of non-human primates. Preliminary results show that the vaginal microbiomes of non-human primates are more diverse and are compositionally distinct from human vaginal microbiomes [3,4]. The composition of bacterial genera found in non-human primates is dissimilar to that seen in humans, most notably with lactobacilli being much less abundant in non-human primates. Our observations point to vaginal microbial communities being an important component of an evolutionary set of adaptations that separates humans from other primates and is of fundamental importance to health and reproductive function. For more than a decade, the Joint Center for Structural Genomics (JCSG) [1] has been at the forefront of developing tools and methodologies that allow the application of high-throughput structural biology to a broad range of biological and biomedical investigations. In the previous phases of the National Institutes of Health's Protein Structure Initiative (PSI; 2000 to 2010) [2], we explored structural coverage of uncharted regions of the protein universe [3], as well as a single organism, allowing complete structural reconstruction of the metabolic network of Thermotoga maritima [4]. In the current phase (PSI:Biology; 2010 to 2015), the JCSG is leveraging its high-throughput platform to explore the structural basis for host-microbe interactions in the human microbiome. The emerging fi eld of metagenomics has been particularly enlightening: the human gut microbiome sequencing projects have already uncovered fascinating new families and expansions of known families for adaptation to this environment. The gut microbiota is dominated by poorly characterized bacterial phyla, which contain an unusually high number of uncharacterized proteins that are largely unstudied. Their infl uence upon human development, physiology, immunity and nutrition is only starting to surface and is thus an exciting new frontier for structural genomics, where we can structurally investigate the contributions of these microorganisms to human health and disease. The JCSG is located at The Scripps Research Institute, the Genomics Institute of the Novartis Next-generation sequencing of RNA (RNA-Seq) is a powerful tool that can be applied to a wide range of biological questions. RNA-Seq provides insight at multiple levels into the transcription of the genome. It yields sequence, splicing and expression-level information, allowing the identifi cation of novel transcripts and sequence alterations. We have been developing and comparing methods for samples that present a challenge: that is, those with low quantity and/or quality RNA. RNA-Seq methods that start from total RNA and do not require the oligo(dT) purifi cation of mRNA will be valuable for such challenging samples. Such methods use alternative approaches to reduce the fraction of sequencing reads derived from rRNA. We will present results from multiple approaches, including the use of not-so-random (NSR) primers for cDNA synthesis, low-C 0 t hybridization with a duplex-specifi c nuclease for light normalization and NuGEN's Ovation RNA-Seq kit. We demonstrated that these three methods successfully reduce the fraction of rRNA to less than 13%, even when starting from degraded RNA. We compared the performance between these methods and with 'gold standard' RNA-Seq data (derived from samples with large quantities of high-quality RNA), using quantitative criteria that evaluate eff ectiveness for genome annotation, transcript discovery and expression profi ling. The application of these methods to samples that contain degraded RNA and/or very low input amounts of RNA will also be presented.

Viral diversity in children with diarrhea in Gambia
Irina Astrovskaya 1 , Bo Liu 1 and Mihai Pop 1

Results
We were able to detect and assemble sequences from known diarrhea-causing viruses (such as rotaviruses, adenoviruses and noroviruses), known human viruses (such as herpesviruses and enteroviruses) and potential diarrhea-causing viruses (such as bocaviruses, astroviruses and parechoviruses). These fi ndings were consistent with independent virology results.
In some clinical cases, sequences from classic viruses were found, but the virology results were negative. COSMIC provides a large number of graphical and tabular views for interpreting and mining the large quantity of information, as well as the facility to export the relevant data in various formats. The website can be navigated in many ways to examine mutation patterns on the basis of genes, samples and phenotypes, which are the main entry points to COSMIC. COSMIC also provides various options to browse the data in a genomic context. Integration with the Ensembl genome browser allows the visualization of full genome annotations, together with COSMIC data, on the GRCh37 genome coordinates. COSMIC also contains its own genome browser, which facilitates data analysis by combining genome-wide gene structures and sequences with rearrangement breakpoints, copy number variations and all somatic substitutions, deletions, insertions and complex gene mutations. The main COSMIC website [1] encompasses all of the available data. However, within COSMIC, the Cancer Cell Line Project [3] is a specialized component, which provides details of the genotyping of almost 800 commonly used cancer cell lines, through the set of known cancer genes. Its focus is to identify driver mutations, or those likely to be implicated in the oncogenesis of each tumor. This information forms the basis for integrating COSMIC with the Genomics of Drug Sensitivity in Cancer project [4], which is a joint eff ort with the Massachusetts General Hospital [5] to screen this panel of cancer cell lines against potential anticancer therapeutic compounds to investigate correlations between somatic mutations and drug sensitivity.
Data on somatic mutations in cancer are being produced at a rapidly increasing rate, and the combined analysis of large distributed datasets is becoming ever more diffi cult. However, COSMIC curates and standardizes this information in a single database, providing user-friendly browsing tools and analytical functions, thus ensuring its role as a key resource in human cancer genetics. Background Recent genome-wide association studies (GWAS) have identifi ed allele T of a single nucleotide polymorphism (SNP), rs2294008, in the prostate stem cell antigen (PSCA) gene as a risk factor for bladder cancer [1,2]. In the present study, we aimed to fi nd additional disease-associated SNPs in the PSCA region and to explore their possible molecular function. Methods Based on information from the 1000 Genomes and HapMap 3 projects, we performed imputation analysis on 3,532 bladder cancer cases and 5,120 healthy controls of European ancestry from the stage 1 bladder cancer GWAS, within ±100 kb of the region fl anking the GWAS signal, rs2294008. The average allele dosage and best-guess genotypes were estimated and tested for association between SNP variants and bladder cancer risk by using unconditional logistic regression. Functional follow-up studies included RNA sequencing in normal and tumor bladder samples and electrophoretic mobility shift assays to examine the potentially altered DNAprotein interactions for SNPs of interest. Results A total of 639 imputed and 37 genotyped SNPs within ±100 kb of the region of the original GWAS signal were tested for genetic association with bladder cancer. In these stage 1 GWAS samples, the SNP rs2294008 had a per-allele odds ratio (OR) of 1.09 (95% confi dence interval (CI) = 1.02 to 1.16, P = 6.93 10 −4 ). Multivariable logistic regression analysis adjusted for the study center, age, gender, smoking status and rs2294008 genotype revealed a novel associated variant, rs2978974 (OR = 1.11, 95% CI = 1.04 to 1.19, P = 1.62 × 10 −3 ). There was low linkage disequilibrium between rs2978974 and the original GWAS signal, rs2294008 (D = 0.19, r 2 = 0.02). Only individuals carrying the risk variant of both SNPs had an increased risk of bladder cancer (OR = 1.24, 95% CI = 1.13 to 1.35, P = 4.69 × 10 −6 ) and not individuals who carried a risk variant of only one of the SNPs (P > 0.05). Stratifi ed analysis suggested that this compound eff ect of rs2294008 and rs2978974 was more signifi cant in males (OR = 1.27, P = 2.80 × 10 −6 ) than in females (OR = 1.08, P = 0.52). rs2978974 resides 10 kb upstream of rs2294008, is marked by an H3K4me3 signal and is in the vicinity of an androgen-receptor-binding site. Using RNA sequencing of bladder samples, we showed that rs2978974 is located within an alternative, untranslated fi rst exon of PSCA. Using the electrophoretic mobility shift assay with nuclear proteins from LNCaP and HeLa cells, we observed that the non-risk-associated allele (G) of rs2978974, but not the risk allele (A), could bind to ELK1, a protein belonging to the ETS family of transcription factors. Conclusions We identifi ed a SNP, rs2978974, in the PSCA region as a novel marker for bladder cancer susceptibility. There was a compound eff ect in carriers of both the rs2294008 and rs2978974 risk variants. The functional relevance of rs2978974 might be related to the loss of ELK1 regulation by the risk allele (A) and diff erential regulation of PSCA mRNA expression. Background Dinofl agellates are a diverse group of ecologically important eukaryotic algae, the global impact of which ranges from the large-scale primary production of oxygen [1] to devastating toxic algal blooms [2]. These organisms have exceptionally large genomes (10 9 to 10 11 bases) [3] and highly duplicated genes (which can occur thousands of times within a single genome) [4]. These and other unusual characteristics have made dinofl agellates diffi cult to study using traditional molecular biology techniques. Sequence data for dinofl agellates are correspondingly sparse, and not a single genome sequence has been published to date. As part of our project called Assembling the Dinofl agellate Tree of Life (DAToL), our laboratory has sequenced the transcriptome of Polarella glacialis. Its genome is estimated to be only 3 Gb in size, making it one of the smallest known dinofl agellate genomes. Because we had to rely on de novo assemblers that had been tested using data from organisms that are extremely divergent from dinofl agellates, we took special care in our attempts to validate the data. Before expanding our analyses to include additional dinofl agellates, we compared the results from diff erent sequencing and assembly methods. Methods Total RNA was extracted from cultured P. glacialis. This sample was then divided and shipped to Macrogen for rRNA degradation, library preparation and sequencing. One library was sequenced on one-eighth of a Roche/454 GS FLX picotiter plate using Titanium chemistry. A second library was sequenced using one lane on an Illumina GAIIx sequencer for 78 cycles in both directions (paired end). The sequences were assembled using Newbler, MIRA, Oases and Trinity, and they were analyzed using various custom scripts.

Results
The total amount of unassembled 454 sequence data added to less than one-third of the combined lengths of only those Trinity transcripts that had a signifi cant BLAST hit against a sequence in GenBank, indicating that we did not achieve complete coverage with our 454 data. Conclusions Our primary hypothesis was that the longer read lengths of the 454 data might allow the corresponding assemblers to better resolve repetitive sequences, which could be instrumental for assembling conserved regions within highly duplicated genes. Our failure to obtain complete coverage with the 454 dataset undermined our ability to test this hypothesis, although we made several other interesting observations. Notably, despite the vas t disparity in the depth of the coverage between the 454 and Illumina assemblies, we observed unique, apparently real sequences within some of the 454 contigs.  few days. However, the sequencing results always turn out to contain several hundred contigs. A multiplex PCR procedure is then needed to fi ll all of the gaps and to link the contigs into one full-length genome sequence [1][2][3][4][5][6][7][8][9][10]. The full-length prokaryotic genome sequence is the gold standard for comparative prokaryotic genome analysis. This study assessed pyrosequencing strategies by using a simulation with 100 prokaryotic genomes.

Results
Our simulation shows the following: fi rst, a single-end 454 Jr Titanium run combined with a paired-end 454 Jr Titanium run may assemble about 90% of 100 genomes into <10 scaff olds and 95% of 100 genomes into <150 contigs; second, the average contig N50 size is more than 331 kb (Table 1); third, the average single base accuracy is >99.99% (Table 1); fourth, the average false gene duplication rate is <0.7% (Table 1); fi fth, the average false gene loss rate is <0.4% (Table 1); sixth, the total size of long repeats Genome Biology 2011, 12(Suppl 1) http://genomebiology.com/supplements/12/S1 (both repeat length >300 bp and >700 bp) is signifi cantly correlated to the number of contigs (Table 4); and, seventh, increasing the read length of a pyrosequencing run could improve the assembly quality signifi cantly (Table  1-3).
Conclusions A single-end 454 Jr run combined with a paired-end 454 Jr run is a good strategy for prokaryotic genome sequencing. This strategy provides a solution to producing a high-quality draft genome sequence of almost any prokaryotic organism, selected at random, within days. It could be the fi rst step to achieving the full-length genome sequence. It also makes the subsequent multiplex PCR procedure (for gap fi lling) much easier, aided by the knowledge of the orders/orientations of most of the contigs. As a result, large-scale full-length prokaryotic genome-sequencing projects could be fi nished within weeks. Background A recent genome-wide association study (GWAS) identifi ed a single nucleotide polymorphism, rs8102137, located 6 kb upstream of the cyclin E1 gene (CCNE1) on chromosome 19q12, as a risk factor for bladder cancer (odds ratio (OR) = 1.13, P = 1.7 × 10 −11 ) [1]. CCNE1 encodes a cell cycle protein that regulates cyclin-dependent kinases and is therefore an important cancer susceptibility gene. Methods This study used 42 bladder tumor samples and 41 normal bladder tissue samples (24 matched normal-tumor pairs), HeLa cells and several prostate and bladder cancer cell lines. Genotyping of rs8102137 in DNA and rs7257694 in both DNA and cDNA samples was performed using an allelic discrimination genotyping assay. TaqMan and SYBR Green assays were used to measure the expression of the diff erent CCNE1 isoforms. The CCNE1 isoforms were cloned into a pFC14A (HaloTag) CMV Flexi Vector. Protein expression of CCNE1 isoforms in normal and tumor bladder tissues and transfected cells was analyzed by western blotting. Subcellular localization of recombinant CCNE1 splicing forms was analyzed by confocal microscopy.
Results CCNE1 mRNA was expressed at a higher level in bladder tumors (n = 42) than in adjacent normal bladder tissue samples (n = 41, 3.7fold, P = 2.7 × 10 −12 ). However, no association was found between mRNA expression level and the genotype of rs8102137. We observed strong allelic expression imbalance for a synonymous coding variation located in the last exon (rs7257694, Ser390Ser), which is in high linkage disequilibrium with rs8102137 (normal bladder tissue samples, n = 41, D' = 1.0, r 2 = 0.815; HapMap CEU samples, n = 60, D' = 0.95, r 2 = 0.68). In normal and tumor tissue samples heterozygous for both single nucleotide polymorphisms, the risk variant of rs8102137 was associated with lower expression of allele T of rs7257694 (normal samples, P = 2.2 × 10 −4 ; tumor samples, P = 1.11 × 10 −10 ). Western blotting analysis of bladder tissue and prostate cell line lysates revealed that the allelic expression imbalance is likely to be related to two CCNE1 protein isoforms that showed a diff erential pattern of expression dependent on the rs8102137 and rs7257694 genotype. We have cloned the alternative splicing forms of CCNE1 and are currently evaluating their functional relevance.
Conclusions Our results suggest that bladder-cancer-associated genetic variants of the CCNE1 gene might contribute to altered cell cycle regulation, owing to diff erential mRNA splicing producing diff erent protein isoforms of CCNE1. Background Metagenomics has allowed the study of a wide range of microbial communities, from those within the sea [1,2] to those of the human body [3]. Increasingly, de novo assembly is the fi rst step in the analysis of these metagenomic samples. As the targets have increased in complexity, computational tools have started to emerge [4,5] to address the challenges presented by the assembly of these datasets. Although the targets and analyses have become more complex, the means of presenting the results has remained the same: a multi-FASTA text fi le. This presentation hides the variation that is present in the sampled biological community. The ability to navigate and view the complexity of a genomic sample may help drive novel biological insights. Here, we present a graphical visualization tool that allows the visual inspection of genome assembly graphs and the characterization of the genomic variation that is present in these graphs (that is, the diff erences between two or more related haplotypes commonly found in metagenomes or higher eukaryotes). Methods Our software, Scaff Viz [6], is open source and was developed as a plug-in for the Cytoscape graph viewer package [7,8]. Our assembly view represents assembly metadata within node/edge attributes. For example, node height corresponds to coverage (the amount of oversampling of a sequence), and node width is proportional to the length of the sequence. We support assemblies from Celera Assembler [9], Newbler [10], Bambus 2 and MetAMOS. The creation and initialization of Cytoscape objects is abstracted to allow a developer to easily add new assembly result formats without knowledge of Cytoscape's API. We developed a layout algorithm based on information from the assembler on node position, orientation and length. Scaff Viz allows users to show (or hide) an arbitrary subset of nodes. The viewer can also output genome sequence that corresponds to any subset of the graph, including all alternative sequences present in all selected subpaths. We believe that this representation may prove to be instrumental in fi nding and characterizing structural variants such as alternative genes, alternative regulatory units or mobile genomic elements.

Results
We evaluated the performance of Scaff Viz on seven datasets of varying size and complexity. We report that the run time is approximately linear with respect to the number of elements in the graph (nodes + edges). The memory scales linearly with respect to the number of nodes. Extrapolating from these factors, a graph of 250,000 contigs can be opened in approximately 2 minutes using approximately 2.5 GB of memory. Scaff Viz is scalable to large graphs and can be run on a laptop. Conclusions We have developed a novel open-source assembly graph viewer, Scaff Viz, as a plug-in for Cytoscape. Scaff Viz supports the output of several popular assembly programs and is scalable to large metagenomic assemblies on a laptop.
Most of the DNA viruses in the gastrointestinal tract are phages, which infect bacterial hosts. Despite phages being the most abundant organisms on Earth, as well as extremely active players in the global ecosystem, much remains unknown about how they function in their natural environments. Advances in whole genome sequencing technologies have generated a large collection of hundreds of phage genomes, allowing deep insight into the genetic evolution of phages, and metagenomics technologies seem to promise more rewarding glimpses into their life cycles and community structures.
Recently, we developed an automated approach to assemble a collection of orthologous gene clusters of double-stranded DNA phages (phage orthologous groups, or POGs). This approach follows the well-known clusters of orthologous groups (COGs) framework to identify sets of orthologs by examining top-ranked sequence similarities between proteins in complete genomes without the use of arbitrary similarity cutoff s, and it thus represents a natural system for examining fast-evolving and slow-evolving proteins alike. This automated approach was designed to keep pace with the rapid and accelerating growth of whole genome information from sequencing projects. In particular, we employ a faster graph-theoretical COG-building algorithm that vastly improves our ability to deal with larger numbers of genomes (N) by reducing the worst-case complexity from O(N 6 ) to O(N 3 × log N). This system encompasses more than 2,000 groups from the almost 600 known phage genomes deposited at the National Center for Biotechnology Information and is in the process of being expanded to include singlestranded DNA phages and single-and double-stranded RNA phages.
Using this approach, we found that more than half of the POGs have no or very few evolutionary connections to their cellular hosts, indicating that these phages combine the ability to share and transduce the host genes with the ability to maintain a large fraction of unique, phage-specifi c, genes. Such genes are useful for targeted research strategies: for example, as diagnostic indicators and fundamental units of systems biology studies. We employed this set of phage-specifi c genes to probe the composition of several oceanic metagenomic samples. Although virus-enriched samples indeed contain more homologous matches to phage-specifi c POGs than a full metagenomic sample also containing cellular DNA, the total gene repertoire of the marine DNA virome is dramatically diff erent from that of known phages. In particular, it is dominated by rare genes, many of which might be contained within viruslike entities such as cellular gene transfer agents rather than true viruses. This result might suggest the necessity of radically rethinking what constitutes the 'virus world' , because the major component of (marine) viromes could be gene transfer agents that encapsidate bacterial and archaeal genes. Background Recent genome-wide association studies have led to the reliable identifi cation of single nucleotide polymorphisms (SNPs) at a number of loci associated with an increased risk of developing specifi c common human diseases. Each such locus implicates multiple possible candidate SNPs as being involved in the disease mechanism, and determining which SNPs actually contribute, and by what mechanism, is a major challenge. A variety of mechanisms may link the presence of a SNP to altered in vivo gene product function and hence contribute to disease risk. We have analyzed the role of one of these mechanisms, nonsynonymous SNPs (nsSNPs) in proteins, for associations found in the Wellcome Trust Case-Control Consortium (WTCCC) study of seven common diseases [1] and the follow-up work. Methods Using HapMap data and linkage disequilibrium information, we identifi ed all possible candidate SNPs associated with increased disease risk. We then applied two computational methods [2,3], based on analysis of protein structure and sequence, to determine which of these SNPs has a signifi cant impact on in vivo protein function (SNPs3D) [4].
Results Several of these disease-associated loci were found to be linked to one or more high-impact nsSNPs. In some cases, these SNPs are in wellknown proteins (such as human leukocyte antigens). In other cases, they are in less well-established disease-associated genes (for example, MST1 for Crohn's disease), and in yet others, they are in proteins that have been poorly investigated (for example, gasdermin B, also for Crohn's disease). Approximately 55% of these disease-associated loci have at least one nsSNP, and about 33% of them have at least one high-impact nsSNP in those regions.
Genome Biology 2011, 12(Suppl 1) http://genomebiology.com/supplements/12/S1 Conclusions Together, these data suggest a signifi cant role for nsSNPs in Background A major goal of metagenomics is to characterize the taxonomic composition of an environment. The most popular approach relies on 16S rRNA sequencing; however, this approach can generate biased estimates owing to diff erences in the copy number of the gene, even between closely related organisms, and owing to PCR artifacts. In addition, the taxonomic composition can also be determined from metagenomic shotgun sequences by matching reads against a database of reference sequences. One major limitation of the computational methods that have been used for this purpose is the use of a universal classifi cation threshold for all genes at all taxonomic ranks. Methods We present a novel taxonomic profi ler for metagenomic sequences, MetaPhyler [1], which relies on 31 phylogenetic marker genes as a taxonomic reference. Because genes can evolve at diff erent rates and because shotgun reads contain gene fragments of diff erent lengths, we propose that better classifi cation results can be obtained by tuning the taxonomic classifi er to the length of the gene fragment, to a particular gene and to the taxonomic rank. Our classifi er uses diff erent thresholds for each of these parameters, and these thresholds are automatically learned from the taxonomic structure of the reference database.

Results
We have randomly simulated about 300,000 DNA sequences of 60 bp and about 70,000 DNA sequences of 300 bp from phylogenetic marker genes. Table 1 shows the performance of the phylogenetic classifi cations from MetaPhyler, PhymmBL [2], MEGAN [3] and WebCARMA [4]. The query sequence itself was removed from the reference dataset when running the programs. The sensitivity of MetaPhyler is signifi cantly higher than that of the other tools in all situations because our classifi er is explicitly trained at each taxonomic rank.
In addition, we have created a simulated metagenomic sample comprising fi ve genomes. Table 2 shows the taxonomic profi les estimated by diff erent approaches. In this setting, MetaPhyler also outperforms the other approaches by more accurately reconstructing the true taxonomic distribution.
Conclusions We have introduced a novel taxonomic classifi cation method for analyzing the microbial diversity from whole metagenome shotgun sequences. Compared with previous approaches, MetaPhyler is more Results We identifi ed a family with a previously undescribed lethal X-linked disorder of infancy comprising a distinct combination of an aged appearance, craniofacial anomalies, hypotonia, global developmental delays, cryptorchidism, cardiac arrhythmia and cardiomyopathy. We used X-chromosome exon sequencing and a recently developed probabilistic disease-gene discovery algorithm to identify a missense variant in NAA10, which encodes the catalytic subunit of the major human amino-terminal acetyltransferase (NAT; also known as hNaa10p). More recently, we became aware that a parallel eff ort on a second unrelated family converged on the same variant. The absence of this variant in controls, the amino acid conservation of this region of the protein, the predicted disruptive change and the co-occurrence in two unrelated families with the same rare disorder suggest that this is the pathogenic mutation. We confi rmed this by demonstrating that the mutant hNaa10p had signifi cantly impaired biochemical activity, and we therefore conclude that a reduction in acetylation by hNaa10p causes this disease.
Conclusions This is one of the fi rst uses of next-generation sequencing to identify the genetic basis of a previously unrecognized X-linked syndrome. It is also the fi rst evidence of a human genetic disorder resulting from direct impairment of amino-terminal acetylation, one of the most common protein modifi cations in humans. We have also demonstrated that a probabilistic disease-gene discovery algorithm (VAAST) can readily identify and characterize the genetic basis of this syndrome. regulon. This regulon encodes the main proteins responsible for the production of both compounds, and it has been found to be functional in only a small number of bacterial genomes, such those of the Gammaproteobacteria (Anaerovibrio, Citrobacter, Enterobacter, Ilyobacter, Lactobacillus and Klebsiella), Firmicutes (Clostridium) and Deltaproteobacteria (Pelobacter). To expand our knowledge and direct further experimental studies on the dha regulon and on related genes involved in anaerobic glycerol metabolism, an extensive genomic analysis was performed to identify the regulon in other species belonging to the Bacteria and Archaea groups. Methods BLAST similarity searches using the dha genes from Klebsiella pneumoniae as the seed were conducted on the National Center for Biotechnology Information database of complete prokaryotic genomes (as of May 2011). Candidate genes were thus confi rmed both by sequence similarity searches (using the BLASTP program) and domain analysis.

Evolutionary and functional relationships revealed by the dha regulon predicted by genomic context analysis
The concatenated tree was based on the alignment of the concatenated sequences of the fi ve dha genes: these encode the three glycerol dehydratase subunits (large, medium and small, encoded by dhaB1, dhaB2 and dhaB3, respectively) and the two glycerol dehydratase-reactivation factor subunits (large and small, encoded by dhaF and dhaG, respectively). Maximum likelihood and neighbor-joining trees were generated using the MEGA program. Bootstrap support (resampled 1,000 times) was calculated, and strict consensus trees were constructed. Based on alignments of homologous sequences, conserved indels that are useful for understanding the origin of the genes of the dha regulon in the archaean Halalkalicoccus jeotgali were investigated.
Results Comparative genomics revealed that the complete dha regulon has a restricted distribution in Bacteria, being found only in species from the Actinobacteria, Firmicutes, Fusobacteria, Gammaproteobacteria and Deltaproteobacteria. From more than 1,000 complete prokaryotic genomes analyzed, approximately 100 possess at least part of this regulon and belong to the groups mentioned above. The members of one group of Archaea (Halalkalicoccus) also carry part of this regulon. The function of this regulon has been characterized in only a few species of bacteria, but its wide distribution in these groups suggests that it may be of far greater importance than was previously recognized. Interestingly, part of this regulon, responsible for the production of HPA, is present in unique species from two groups of Bacteria and one from Archaea. In these three groups (Deltaproteobacteria, Alphaproteobacteria and Archaea), the genes dhB1, dhB2, dhB3 and dhaF and dhaG are present only in one species from each group. Moreover, these genes have similar GC content (59.3% and 60.6%), which is higher than the host genome. Interestingly, this regulon is found in a plasmid in the archaean H. jeotgali. Phylogenetic analysis reinforces the idea that these regulons have a common origin, because the genes are grouped together and so were possibly acquired by horizontal gene transfer. These genes were also analyzed for the presence of indels shared with the bacteria, but no distinctive indels were found.
Conclusions Although in silico inferences should normally be confi rmed and tested by experimentation, the present work provides a taxonomic distributional profi le of the genes responsible for the anaerobic metabolism of glycerol in Bacteria and Archaea, providing new insight into the taxonomy and evolutionary history of this regulon. Curiously, some of these organisms live with only part of the regulon. In addition, this study also provides a useful framework for further investigations of the functions of these genes. Background Thiol peroxidases have been conserved throughout evolution and are found in almost every known organism from bacteria to humans. These proteins play a key role in maintaining redox homeostasis and have been implicated in other processes such as cell signaling and sensing hydrogen peroxide and passing this signal along to transcription factors. To gain a better understanding of the role that each thiol peroxidase plays in redox regulation on a global level, Fomenko and colleagues [1] performed a series of microarray experiments in which diff erent combinations of the genes encoding the eight thiol peroxidases (three glutathione peroxidase homologs (Gpx) and fi ve peroxiredoxins (Prx)) present in yeast were knocked out, including one mutant (8-Δ) in which all eight peroxidases were removed. Surprisingly, all of the mutants, including 8-Δ, were viable and could withstand redox stresses; however, they were unable to activate or repress transcriptional events in response to hydrogen peroxide treatment, which was most evident in the 8-Δ mutant. In our work, network analysis was used to gain a better understanding of the biological networks whose gene expression is aff ected by these mutations. Methods Microarray data (provi ded by [1]) was processed for input into the Cytoscape plug-in jActiveModules. Active sub-networks for select mutants were identifi ed using all yeast interactions found in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [2] as the background network (including protein-protein, metabolic and gene expression interactions). Nodes in each sub-network were input into the Database for Annotation, Visualization and Integrated Discovery (DAVID) [3] to identify which KEGG pathways were present. Results Two hundred and six genes appeared in one or more of the active sub-networks. Only seven genes were present in the sub-networks of all strains. These were a known oxidative stress-induced aldose reductase (GRE3), four putative aryl-alcohol dehydrogenases (AAD3, AAD6, AAD10 and AAD14), a mitochondrial aldehyde dehydrogenase (ALD4) and a xylulokinase (XKS1). All of the genes were upregulated on average by 6-to 12-fold in all strains, except for 8-Δ with a 1.5-fold average upregulation and 5Prx-Δ with a 3-fold average upregulation. Many metabolic pathways were aff ected by the knockouts; the pathway types aff ected depended on which peroxidase gene was knocked out. This result suggests that diff erent thiol peroxidases may have a signifi cant and specifi c impact on the regulation of metabolic pathways during oxidative stress. Surprisingly, the Gpx3-Δ active sub-network was similar to the Gpx1-Δ and Gpx2-Δ sub-networks. Gpx3 is known to sense hydrogen peroxide and pass that signal along to transcription factors; thus, it was expected that this subnetwork would diff er from that of the other Gpx mutants. Additionally, our results showed that amino acid metabolism, biosynthesis and degradation pathways were active in wild-type cells but were present in few mutant strains.  The UGT1A locus encodes nine UGT proteins, which belong to the phase II cellular detoxifi cation system. UGTs are functionally important for the detoxifi cation of aromatic amines, which are found in industrial chemicals and tobacco smoke and are known risk factors for bladder cancer. The UGTencoding genes have exons 2 to 5 in common but have diff erent fi rst exons, which defi ne the enzymatic activity and substrate specifi city of the gene products.

Methods and results
We sequenced all nine highly similar alternative fi rst exons for the UGT-encoding genes of up to 2,000 individuals. We identifi ed 26 known nonsynonymous and 17 known synonymous coding variants but no novel variants. Imputation based on the GWAS dataset, a combined reference panel of HapMap 3 and the 1000 Genomes Project, and a subset of GWAS samples genotyped for all of the identifi ed coding variants generated data for 1,170 SNPs within the whole UGT1A region. Of these markers, the strongest association was detected for an uncommon protective genetic variant that explained the original GWAS signal (odds ratio (OR) = 0.55, 95% confi dence interval (CI) = 0.44 to 0.69, P = 3.3 × 10 −7 in 4,035 cases and 5, 284 controls; D' = 0.96, r 2 =0.23 with rs11892031). No residual association in this region was detected after adjustment for this SNP. A typical genetic variant identifi ed by GWAS for a common disease is expected to be a common allele (>10% minor allele frequency) that increases the disease risk. We show that the novel associated variant is an uncommon protective allele (1.14% in cases and 2.5% in controls). Interestingly, the risk allele (G) is conserved in 33 species, whereas the protective allele (T) is a human-specifi c variant. Even though this SNP is a synonymous coding variant, we show its association with quantitative mRNA expression of a specifi c functional splicing form of UGT1A6, probably through an exonic splicing enhancer. Conclusions This study exemplifi es that uncommon protective genetic variants are unusual suspects that may play important but underestimated functional roles in complex traits. Background Horizontal gene transfers (HGTs) are pervasive in prokaryotes [1], being the routes of net-like evolution that collectively dominate the evolution of prokaryotes [2]. However, in eukaryotes, the eff ect of HGT has not been thoroughly analyzed, with the exception of the massive HGT from the endosymbionts [3]. Here, we report a comprehensive analysis of likely HGT events in diff erent groups of unikonts (Amoebozoa, Archamoebae, Mycetozoa, the Fungi/Metazoa group, Choanofl agellida, Fungi and Metazoa).
Methods We analyzed the complete proteomes of 36 species of unikonts: 1 from the Archamoebae, 1 from Mycetozoa, 18 from Fungi, 13 from Metazoa and 1 from Choanofl agellida. These proteomes were manually selected to widely represent the unikont supergroup. Initial pre-candidate genes were obtained by analyzing each proteome using the DarkHorse program [4]. The program BLASTClust was then used to make clusters of putative unique transfer events at the origin of the diff erent groups of unikonts. These clusters were separated into two groups: group I candidate clusters (clusters with no eukaryotic representative other than the unikont group analyzed), and group II candidate clusters (clusters with representatives from prokaryotes, the unikont group analyzed and other eukaryotes). Sequences from group I candidate clusters were analyzed using BLAST versus nr and RefSeq databases, compared with the clusters of orthologous groups for eukaryotic complete genomes (KOGs) [5] and manually curated to remove false positives that result from bacterial contamination of the genomic DNA. Group II candidate clusters were analyzed using a series of automatic, conservative fi lters to assess the quality of the candidates. Finally, all clusters were phylogenetically analyzed to defi ne the fi nal candidates and to infer putative donors. Results Using this methodology, we detected numerous probable HGT events from prokaryotes (mainly Bacteria) to unikonts. These events are not distributed uniformly throughout the evolution of unikonts: for example, almost all HGTs detected in Amoebozoa occurred after the divergence of Archamoebae and Mycetozoa. Importantly, we also detected many HGT events from Bacteria to Fungi, Choanofl agellida and MetazoaConclusions Although HGTs are not as pervasive in eukaryotes as in prokaryotes, the amount of HGT detected in this study suggests that the acquisition of genes from Bacteria played a major role in the evolution of the unikonts. and enhance its survival. We believe that these networks involve genes that tend to be coordinated in their copy number alterations, even when they are located at a distance in the genome. Radiation hybrid (RH) cells have a random assortment of genes as triploid rather than diploid. Our recent work studying genetic networks in libraries of RH cells has elucidated key survivalenhancing interactions with high specifi city [1]. Because of the hardiness of the RH clones, statistically signifi cant patterns of co-inherited, unlinked triploid gene pairs pointed to the cell survival mechanism. We identifi ed more than 7.2 million signifi cant interactions at single-gene resolution using the RH data. Methods Our work with the RH data provided the rationale for an investigation of cancer survival networks, in particular for glioblastoma multiforme, a formidable brain cancer for which extensive datasets are available but few treatment options. We investigated correlated patterns of copy number alterations for distant genes in glioblastoma multiforme tumors using the same method we employed to construct the RH survival network. Public data were analyzed from 301 glioblastomas that had been assessed for copy number alterations using array comparative genomic hybridization [2]. Results The glioblastoma and RH survival networks overlapped signifi cantly (P = 3.7 × 10 −31 ). We therefore exploited the high-resolution mapping of the RH data to obtain single-gene specifi city in the glioblastoma network. The combined network features 5,439 genes and 13,846 interactions (false discovery rate (FDR) <5%) and suggests novel approaches to therapy for glioblastoma. For example, although the epidermal growth-factor receptor (EGFR) oncogene is frequently activated in glioblastoma, EGFR inhibitors have limited therapeutic effi cacy [3]. In the combined glioblastoma survival network, there are 46 genes that interact with EGFR, of which ten (22%) happen to be targets of existing drugs. This observation suggests that a fl anking attack strategy that strikes at both EGFR and its partner genes in the glioblastoma survival network may be an eff ective approach to treating these tumors. Conclusions By elucidating a genetic survival network for glioblastoma, we gained insight into the mechanisms of proliferation of this cancer and opened up new avenues for therapeutic intervention. Background Hundreds of diverse genetic loci have been linked to autism spectrum disorders (ASDs), making large-scale analysis essential for understanding the molecular events underlying the pathogenesis of these disorders. Our laboratory fi rst released the autism database AutDB in 2007 as a bioinformatics tool for systematic curation of all known ASD candidate genes [1][2][3]. AutDB was designed with a systems biology approach, integrating genetic entries within the Human Gene module with corresponding behavioral, anatomical and physiological data in the Animal Model module. In June 2011, we released a new Protein Interaction (PIN) module of AutDB, which serves as a comprehensive, up-to-date resource on the direct protein interactions of ASD-linked genes.
Methods To curate the PIN module, our researchers utilize a multi-level annotation model to systematically search, collect and extract information entirely from published, peer-reviewed scientifi c literature. Although we initially consult public molecular interaction databases (HPRD and BioGRID) and commercial molecular interaction software (Pathway Studio, version 7.1), every interaction is manually extracted and verifi ed by evaluating the primary reference articles from PubMed. Our manual curation has proved critical for accurate annotation, because these references were the second largest source of references for the initial PIN dataset, providing more interactions than both HPRD and Pathway Studio. Each ASD gene entry within the PIN module is presented as a multi-level display, with interactive graphical and tabular views of its corresponding interactome.

Results
The initial PIN dataset includes interactomes for 86 ASD candidate genes, with a total of 1,311 direct protein interactions garnered from 533 unique primary references. These interactomes are composed of 6 interaction types and 13 species, documented by 402 distinct pieces of evidence. Our researchers will expand and maintain the data content of the PIN module with systematic updates.
Conclusions We have created an integrated bioinformatics tool that can be used for the large-scale analysis of the biological relationships among ASD candidate genes. Such network analysis is envisioned to provide a framework for identifying the key molecular pathways underlying ASD pathogenesis, potentially leading to the development of novel drug therapies. Background Bladder cancer is the 9th most common cancer worldwide and the 13th most common cancer-related cause of death. Bladder cancer frequently recurs after the removal of primary carcinomas. This recurrence leads to repeated surgeries and long-term treatment and surveillance, making it the most expensive type of cancer to treat. Genetic factors and environmental factors such as cigarette smoking and occupational exposure to aromatic amines are linked to bladder cancer risk. Genomewide association studies (GWAS) for bladder cancer have identifi ed multiple genetic variants within genes and regions, including TP63, TERT-CLPTMIL and 8q24.21, to be highly associated with disease risk. Whole transcriptome sequencing (RNA-Seq) is a revolutionary tool for generating a large amount of qualitative and quantitative information, thus helping to explore known and novel transcripts, splicing forms and fusion genes. Methods To understand the genetic and genomic landscape of the GWAS susceptibility regions, we investigated and characterized the entire transcriptome of normal and tumor bladder tissue samples by using powerful massively parallel RNA sequencing. We used an Illumina HiSeq 2000 instrument to sequence six paired samples of normal and tumor bladder tissues. For each of the samples, we generated 50 Gb of 100-bp reads to represent the whole transcriptome. Results Using the Bowtie/TopHat and Samtools packages, we successfully aligned approximately 80% of the total sequence reads against the human genome reference sequence (build 19). Our analysis sought to identify alternative splicing forms, novel exons, non-coding transcripts and chimeric fusion events. Total levels of mRNA in normal and tumor samples were evaluated by Cuffl inks analysis based on the Ensembl transcripts database. Multiple splicing isoforms were identifi ed for some of the GWAS susceptibility genes, and some of these isoforms were diff erentially expressed between the tumor and normal samples. We found that novel transcripts and non-coding RNAs corresponding to gene desert regions such as 8q24 were abundantly expressed. Our next step will focus on validation of these diff erentially expressed genes and novel transcripts by using quantitative RT-PCR on independent samples. Conclusions Using RNA-Seq, we explored transcripts corresponding to candidate regions identifi ed by bladder cancer GWAS. Some of these transcripts demonstrated splicing variability and diff erential levels of expression between normal and tumor tissue samples, which might be of importance for bladder cancer. Background Recent genome-wide association studies (GWAS) have identifi ed multiple genetic variants associated with the risk of developing prostate cancer (PrCa). At least ten PrCa-associated single nucleotide polymorphisms (SNPs) are located within a gene-poor region on chromosome 8q24, but the functional mechanisms of each of these variants remain unknown. Normal prostate development, as well as tumor initiation and progression, greatly depends on the androgen receptor (AR) and its ligands, testosterone and 5α-dihydrotestosterone. We hypothesized that genetic variants associated with PrCa risk might be important owing to their eff ects on AR-binding sites.

Methods and results
We comprehensively explored 11 PrCa GWAS published as of July 2011 in the National Human Genome Research Institute's GWAS database [1] and in PubMed [2]. We selected ten SNPs from the 8q24 region that were signifi cantly and consistently associated with PrCa in Caucasian datasets (P < 5 × 10 −7 ). By querying the CEU 1000 Genomes Project panel, we generated a list of 224 SNPs in high linkage disequilibrium (r 2 > 0.8) with the ten selected GWAS SNPs. Of all of the SNPs on this list, six variants were located in the regions identifi ed as AR-binding sites, based on AR chromatin immunoprecipitation (ChIP)-Seq data from the University of California, Santa Cruz's genome browser [3]. To test for diff erential binding of AR to alleles of the six SNPs, we developed a protocol for quantitative multiplex allele-specifi c ChIP (AS-ChIP) assays. Confi rmatory AS-ChIP with AR-specifi c antibodies in the LNCaP cell line showed that fi ve of these SNPs were heterozygous in the LNCaP cell line, and four of them showed statistically signifi cant allele-specifi c diff erences in AR binding (P-value range = 0.0005 to 0.04, based on four biological replicates of AS-ChIP). Conclusions Our data suggest that some of the PrCa-associated SNPs within the 8q24 region might create or disrupt binding sites for AR, thereby aff ecting important regulatory networks in normal and cancerous prostate tissue. Background Metagenomics has opened the door to unprecedented comparative and ecological studies of microbial communities, ranging from the sea [1] to the soil (the terragenome) to within the human body [2,3]. Most analyses begin with assembly, as the short reads that are characteristic of most datasets severely limit the ability to classify the data taxonomically [4][5][6][7] and require considerable computational resources to perform comparative analyses (such as BLAST against public databases). In addition, given that many sequences are likely to be from novel organisms, classifi cation methods relying on databases fail to acknowledge most of the novel species present in the dataset. In an attempt to move away from reference-based analysis, computational tools based on promising algorithmic and statistical methods for metagenomic de novo assembly have recently started to emerge [8,9]. However, to date, they either are ill-suited to large datasets or have yet to off er signifi cant improvements over existing genome assemblers that were not designed for metagenomic assembly.

Methods
Here, we describe MetAMOS [10], an open-source, modular assembly pipeline built upon AMOS and tailored specifi cally for metagenomic next-generation sequencing data. MetAMOS is the fi rst step toward a fully automated assembly and analysis pipeline, from mated reads (Illumina and 454) to scaff olds and ORFs. Currently, MetAMOS has support for four assemblers (SOAPdenovo [11], Newbler, CABOG and Minimus [12]), three annotation methods (BLAST, PhymmBL and MetaPhyler), two metagenomic gene prediction tools (MetaGeneMark and Glimmer-MG) and one unitig scaff older engineered specifi cally for metagenomic data (Bambus 2). We also provide a novel graph-based algorithm to propagate annotations rapidly to all contigs in an assembly using, for example, only the largest contigs or contigs with high-confi dence classifi cation. MetAMOS has three principal outputs: subdirectories containing FASTA sequence of the contigs/scaff olds/ variant motifs belonging to a specifi ed taxonomic level, a collection of all unclassifi ed/potentially novel contigs contained in the assembly, and an HTML report with detailed assembly statistics and summary charts.

Results and conclusions
We compared MetAMOS with other metagenomic assembly tools (Meta-IDBA and Genovo) and with genome assemblers that have previously been used with metagenomic data (CA-met and SOAPdenovo). We used both a mock/artifi cial dataset generated for the Human Microbiome Project (HMP) project and real metagenomic samples from the HMP and its European counterpart (MetaHIT). On the mock dataset, MetAMOS compares favorably to existing metagenomic and genomic assemblers with respect to several validation metrics that take into account contig accuracy in addition to size. On the real dataset, MetAMOS also outperforms the existing software. These improvements can largely be attributed to heavy reliance on Bambus 2 and to assembly verifi cation techniques that help identify and remove potentially chimeric contigs while running the pipeline.
In terms of biology, we were able to report several novel variant motifs that would be challenging at best to identify and extract from the output of other methods. In addition, much emphasis was placed on making MetAMOS compatible with a variety of next-generation sequencing technologies, genome assemblers and annotation methods, making the pipeline highly customizable for the beginner and advanced bioinformatics user alike. Gaucher disease is the most common lysosomal storage disorder. It results from an inherited defi ciency of the enzyme glucocerebrosidase (GBA); accumulation of the substrate of this enzyme has many clinical manifestations.

The mutation spectrum in Indian patients with Gaucher disease
Since the discovery of the GBA gene, more than 200 mutations have been identifi ed, but only a handful of mutations are recurrent (L444P, N370S, IVS2, D409H and 55Del). To determine the spectrum of mutations in the Indian population, we performed mutational screening in children with Gaucher disease. Twenty-four patients from twenty families were enrolled in this study, after written informed consent was obtained. The diagnosis of Gaucher disease was based on mandatory clinical and biochemical analysis. An initial screening for fi ve common mutations was carried out using PCR-RFLP. Patients who were negative for common mutations were screened by sequencing exons 9 to 11 (a mutation hotspot region) [1]. We identifi ed common mutations (L444P, N370S, IVS2 and D409H [2], and 55Del [3]) in approximately 50% of the patients. L444P (c.1448T>C) was the most frequently identifi ed, followed by D409H in our patients. Western data shows that N370S is the most common mutation in Romanian patients [4]. One polymorphism (E340K) was identifi ed in two patients who were compound heterozygotes for A456P/R463C and S237F/A269P, respectively. Our data highlight the spectrum of mutations that lead to Gaucher disease in the Indian population. Background Given diff erential gene expression data across divergent mutant strain arrays of two enzyme subgroups, it would be logical to segregate by protein group ablation (PGA). Discrete correlate summation (DCΣ) was utilized to examine the diff erential eff ects of a hydrogen peroxide stressor on discrete and total yeast knockouts of the genes encoding glutathione peroxidase (Gpx) and peroxiredoxin (Prx), both groups starting from the wild-type (WT) strain [1]. While the half-life of the total Gpx knockout mutant is intermediate between that of the WT and the transient total Prx knockout mutant, the distribution of passage number of the various mutant strains can be separated into two groups independent of Gpx and Prx state. Based on half-viability, totalPrx <<<< nPrx << Gpx3 = Tsa1 < totalGpx < mPrx <<< Gpx1 < Gpx2 << Ahp1 = WT <<< Tsa2 (P < 0.0005, two tailed t-test, n = 5, 6). DCΣ was also employed for the boundary between robust and gracile cultures. The aim of this study was to fi nd the characteristic response of the transcriptome, from the perspective of PGA versus strain viability (SV). Methods DCΣ is a method used to score variables that can be classifi ed into two groups [2]. It is a composite score of a gene's mean group change and overall interaction diff erence relative to all others tested. Transcripts were included in this analysis only if the values for all conditions passed microarray quality control and were present in the Kyoto Encyclopedia of Genes and Genomes (KEGG) network [3]. Randomly sorted edges were sampled for comparison (P < 0.001, two tailed t-test, n = 8,372). Edges that were sorted on average DCΣ score and grouped by biological process yielded a distinctive topology (P < 1e-85, two tailed t-test, n = 8,372). The identifi ed transcripts were subjected to functional annotation in the Database for Annotation, Visualization and Integrated Discovery (DAVID) [4].

Boundary distinction interpretation of microarray data via discrete correlate summation
Results Application of DCΣ to the individual and complete knockouts of Gpx (3 genes) and Prx (5 genes) identifi ed 92 transcripts based on PGA and 43 based on SV, with a 13 gene overlap (corresponding to the proteins Arg1p, Aah1p, Ade17p, Pgm2p, Cat2p, Cdd1p, Mae1p, Arg3p, Nma2p, Ole1p, Cta1p, Spb1p and Cds1p). Functional annotation analysis of the 92 PGA transcripts identifi ed the following functions: pyrimidine metabolism, steroid biosynthesis, purine metabolism, RNA polymerase and terpenoid backbone biosynthesis. Ergosterol biosynthesis, gluconeogenesis and transcription from Pol I/III promoters were major biological process categories for this set. Interestingly, terpenoids feed into the steroid pathway, which results in the vitamin D2 precursor ergosterol. Analysis of the 43 SV transcripts identifi ed starch and sucrose metabolism, butanoate metabolism, and fructose and mannose metabolism. Stress response was the key biological process for this arm of the study. No functional annotations were statistically signifi cant for the common genes. Transcripts identifi ed by PGA of either the Gpx-or Prxencoding genes tend toward transcriptional control mechanisms, whereas SV-associated transcripts track with metabolic necessities.

A computational approach to identify transposable element insertions in cancer cells
Israel T Silva 1,2 , Daniel G Pinheiro 1 and Wilson A Silva Jr 1,3 Background Transposable elements (TEs) in the human genome may contribute to molecular evolution, hereditary diseases and cancer [1][2][3]. Therefore, analyzing the impact of TEs in the genome is necessary to better characterize genetic events related to tumorigenesis. Here, we used a computational approach to identify TE insertions in publicly available data for exome sequences in lymphoblastoid and breast tumor cells derived from the same patient. Methods A total of 29,340, sequences from the cell lines HCC1954 (18,365,271) and HCC1954BL (10,975,107) were used to investigate gene fusion with TEs (gfTEs) [4,5]. The RepeatMasker and Burrows-Wheeler Alignment (BWA) tools were used to identify and to map gfTEs, respectively. We also used BEDTools to fi nd overlaps between gfTEs and genome annotations. Human mRNAs and RepeatMasker tracks were downloaded in BED format from the GRCh37/ hg19 assembly. Repbase was used to fi lter the eukaryotic TEs.

Results
RepeatMasker was used to identify gfTEs in the exome reads. Next, the repeat masked reads were aligned against the reference genome using BWA. Finally, we fi ltered the aligned reads to exclude those without TEs (length of Ns <15, Ns means block of nucleotides masked), those with alignments showing low sequence identity (<95%) or those with a small hit length (<50 nucleotides). The study focused on the detection of TEs in coding sequence gene regions. A total of 3,307,608 reads were excluded, and 23,841 reads were predicted as cancer-specifi c gfTEs. Table 1 shows the number of gfTEs distributed among the TE families and highlights the members with higher frequency in both cell lines. Insertions of LINE/L1 and SINE/Alu were the most frequent. The Gene Ontology analysis for the biological process and molecular function terms showed a bias toward membrane receptor and cell adhesion proteins. Conclusions We used a computational approach to identify putative cancerspecifi c gfTEs using human exome capture sequences. Interestingly, the total number of gfTEs was similar in normal and tumor cell lines, but the Gene Ontology analysis revealed an enrichment of insertions in genes encoding protein receptors and cell adhesion molecules. These results suggest that TEs could be contributing to cancer development.
Background Genome-wide association studies (GWAS) of human complex disease have identifi ed a large number of disease-associated genetic loci, which are distinguished by distinctive frequencies of specifi c single nucleotide polymorphisms (SNPs) in individuals with a particular disease. However, these data do not provide direct information on the biological basis of a disease or on the underlying mechanisms. Many studies have shown that variations in gene expression among individuals, as well as among cell types, contribute to phenotype diversity and disease susceptibility. Recent genome-wide expression quantitative trait loci (eQTL) association (GWEA) studies have provided information on genetic factors, especially SNPs, that are associated with gene expression variation. These expression-associated SNPs (exSNPs) have already been utilized to explain some results of GWAS for diseases, but interpretation of the data is handicapped by low reproducibility of the genotype-expression relationships. Methods To address this problem, we established several gold standard sets of high-reliability exSNPs based on multiple occurrences in diff erent GWEA studies in various human populations and cell types. We then related these data to results from GWAS for diseases, to fi nd a set of disease-associated loci that are likely to have an underlying expression mechanism. HapMap linkage disequilibrium data were utilized to allow the comparison of GWEA results from studies that employed diff erent microarray SNP sets.
Results We integrated the current gold standard data with SNPs in diseaseassociated loci from the Wellcome Trust Case-Control Consortium (WTCCC) GWAS of seven common human diseases. Approximately one-third of these disease-associated loci in the WTCCC GWAS were found to be consistent with an underlying expression change mechanism. Comparing separate gold standard sets for Caucasian (CEU), African (YRI) and Asian (ASN) populations also allowed us to investigate which exSNPs contribute to populationspecifi c eQTLs. Conclusions Use of the gold standard set of SNP-expression relationships has enabled us to more reliably determine the role of expression changes in common human diseases. Background Archaeal and bacterial ribosomes contain more than 50 proteins. Thirty-four ribosomal proteins (r-proteins) are universally conserved in the three domains of cellular life (Bacteria, Archaea and Eukarya), and 33 r-proteins are shared between Archaea and Eukarya to the exclusion of Bacteria; there are also 23 Bacteria-specifi c, 1 Archaea-specifi c and 11 Eukarya-specifi c r-proteins [1]. Despite the high sequence conservation of r-proteins, the annotation of r-protein genes is often diffi cult because of their short lengths and biased sequence composition. Methods To perform a comprehensive survey of prokaryotic r-proteins, we developed an automated computational pipeline for the identifi cation of r-protein genes and applied it to 995 completely sequenced bacterial genomes and 87 archaeal genomes available in the RefSeq database. The pipeline employs curated seed alignments of r-proteins to run positionspecifi c scoring matrix (PSSM)-based BLAST searches against six-frame genome translations, thus overcoming possible gene annotation errors.

Phylogenomics of prokaryotic ribosomal proteins
Likely false positives are identifi ed using comparisons against the original seed alignments.

Results
In the course of this analysis, we gained insight into the diversity of prokaryotic r-protein complements, such as missing and paralogous r-proteins and distributions of r-protein genes among chromosomal partitions. A phylogenetic tree was constructed from a concatenated alignment of 50 almost-ubiquitous bacterial r-proteins. The topology of the tree is generally compatible with the current high-level bacterial taxonomy, although we detected several inconsistencies, possibly indicating uncertain or erroneous classifi cation of the respective bacteria. Similarly, a concatenated alignment of 57 ubiquitous archaeal proteins was used for an archaeal phylogenetic tree reconstruction. In both Bacteria and Archaea, the patterns of the presence/absence of non-ubiquitous r-proteins suggest several independent losses and/or gains of these proteins. According to parsimony reconstruction, three bacterial and fi ve archaeal r-proteins do not appear to be ancestral. Remarkably, all fi ve non-ancestral archaeal r-proteins are present in Eukarya.
Conclusions Extended sets of prokaryotic r-proteins were created. Alignments of these sets may be used as new seed profi les for the identifi cation of r-proteins in new genomes and for comparative genomics studies. Broad clinical application of ultra-high-throughput sequencing is imminent. In a few notable cases, actionable information has been discovered from sequencing, and the number of such cases is likely to increase. At present, there are no widely accepted genomic standards or quantitative performance metrics. These are needed to achieve the confi dence in measurement results that is expected for sound, reproducible research and regulated applications. The National Institute of Standards and Technology (NIST) has been approached about considering development in this area by several commercial entities and regulatory agencies. There is great enthusiasm for translation of sequencing from the research community to clinical practice, and standards that can be used to inform confi dence in measurement results (for instance, through validation studies, profi ciency testing and routine quality assurance) may be an enabling factor in that goal. NIST is currently gathering input from the genomics community about which reference materials and data would be useful. For example, NIST and the Coriell Institute for Medical Research may develop genomic reference material from cell lines from families that have already been characterized by a variety of sequencing methods (for example, the cell line from which NA12878 DNA is derived). In addition, we may build synthetic DNA constructs to test specifi c questions about measuring diff erent types of variants or combinations of variants in diff erent genomic contexts. For example, we might create pairs of constructs with single nucleotide polymorphisms, indels and/or structural variants in GC-or AT-rich regions or repeat regions.
To ensure the design of appropriate standards, we are interested in discussing the design and application of genomic reference materials with any interested parties. proteins responsible for successful formation of the mating plug [1]. This plug forms in the male and is transferred to the female during mating, hence initiating the PPIs in both sexes. As is the case in Drosophila melanogaster, a close relative of A. gambiae, some MAG proteins responsible for the formation of the mating plug have been shown to alter the post-mating behavior of females.

Methods and results
The STRING database for known PPIs was used to identify orthologs of A. gambiae proteins in Drosophila (Table 1). Twentyseven proteins are known to form the mating plug in A. gambiae, and 16 others were obtained as strings in the STRING database. Chromosome synteny comparisons for proteins with more than 50% identity between species were carried out using the Artemis Comparison Tool (ACT version 9.0), and this showed 24.39% matches (M), 12.20% mismatches (MM) and 63.41% unmatched (NM). The network built in Cytoscape (version 2.8.0) with the UniProt IDs for these Drosophila orthologs showed 14 complexes, with 4 of them being for Drosophila. The network showed 555 nodes and 2,344 edges. The top 50 identifi ed hubs in the network showed a range of 3 to 30 interactions. The expression values for these proteins in FlyAtlas showed that they are upregulated in the reproductive tissues of both sexes. To understand the processes involved in plug formation, the Reactome database was used, and the hub proteins were identifi ed in 49 of the 2,021 known processes in Drosophila. Twelve proteins were involved in the following processes: metabolism of proteins (8.8e-13), gene expression (2.0e-06), 3'-UTR-mediated translational regulation (7.7e-08), regulation of β-cell development (1.3e-06), diabetes pathways (6.8e-06), signal recognition (preprolactin) (5.0e-07) and membrane traffi cking (1.3e-03). Of the top 50 proteins, 92% had orthologs in A. gambiae, with one identifi ed in the mating plug and four others identifi ed as strings to AGAP009584, which is found in the mating plug. Acp29AB was identifi ed in the network and is known to induce post-mating responses in Drosophila, confi rming that the network is reproductive and giving an insight into the possible pathways involved. The CG9083 (Q8SX59) protein was ranked fi rst among the hub proteins but has no ortholog in A. gambiae. Interestingly, it has the same protein properties as the Plugin protein (AGAP009368) in A. gambiae, suggesting that Plugin may be the main protein in the PPI reproductive network in A. gambiae. The Whelan and Goldman (WAG) maximum likelihood tree evaluations of the plug proteins in A. gambiae and their orthologs in Drosophila showed that these proteins are involved in similar biological processes in both species, but the A. gambiae protein evaluation provided a better explanation for the expected process as it clustered in both pre-mated and post-mated PPIs. This table shows the 27 proteins known to be in the mating plug of A. gambiae [1], derived predominantly from the male. The 16 strings predicted There are more than 20,000 genomic studies comprising 500,000 samples freely available in the Gene Expression Omnibus (GEO) database [1]. However, accessing these data requires complex computational steps, including structuring and formatting the clinical vocabulary used to annotate the samples. These complex steps hinder the accessibility of genomic datasets through visualization and analysis software platforms, such as GenePattern and R/Bioconductor, therefore hampering the pace of research. InSilico DB [2] is an online platform that provides a complete collaborative solution for structuring and formatting clinical annotations from GEO, making GenePattern and R datasets one click away for researchers. InSilico DB has made available powerful and intuitive online curation tools to structure the metadata of GEO datasets. The database is automatically updated daily, through GEO import pipelines. Datasets can have multiple annotations given by diff erent users, and one user can have multiple versions of an annotation to suit diff erent experimental questions. The InSilico DB platform supports datasets from Aff ymetrix human gene expression platforms, which account for 2,900 studies comprising 110,000 samples, making InSilico DB the largest public database of manually curated human gene expression samples. In addition to the web interface, InSilico DB off ers programmatic access through an R/Bioconductor package [3]. Future releases of InSilico DB will include Illumina RNA-Seq platform data and Aff ymetrix mouse gene expression data.
Background Research focused on genome-wide association studies (GWAS) has resulted in the identifi cation of genetic variants associated with risk of developing breast cancer. These genetic variants are providing valuable insight into the genetic susceptibility landscape of breast cancer. However, to date, data generated from GWAS have not been maximally leveraged and integrated with gene expression data to identify the genes and pathways associated with the most aggressive subset of breast cancers, triple-negative breast cancer (TNBC), which accounts for about 20% of all breast cancers. TNBC disproportionately aff ects young premenopausal women and has a higher mortality rate among African-American women. At present, no targeted treatments exist for TNBC, and standard chemotherapy remains the only therapeutic option. Integration of genetic mapping results from GWAS with gene expression data could lead to a better understanding of the genetic mechanisms underlying the molecular basis of the TNBC phenotype and to the identifi cation of potential biomarkers for the development of novel therapeutic strategies. Methods We mined data from 43 GWAS involving over 250,000 patients with breast cancer and 250,000 controls, reported through April 2011, to identify genetic variants (single nucleotide polymorphisms (SNPs)) and genes associated with risk for breast cancer. We then integrated GWAS information with gene expression data from 305 subjects (162 cases and 143 controls) to stratify TNBC and other breast cancer subtypes, as well as to identify functionally related genes and multi-gene pathways enriched by SNPs that are associated with risk for breast cancer and are relevant to TNBC. To stratify TNBC and to identify functionally related genes, we performed supervised and unsupervised analysis of gene expression data. We used a false discovery rate to correct for multiple testing. Pathway prediction and networking visualization was performed using Ingenuity Systems' software.

Results
Combining GWAS information with gene expression data, we identifi ed 448 functionally related genes that stratifi ed breast cancer subtypes into TNBC. A subset of these genes (130 genes) contained SNPs associated with risk for breast cancer; of these 130 genes, 122 correctly stratifi ed TNBC. Pathway prediction revealed multi-gene pathways enriched by SNPs that are signifi cantly associated with risk for breast cancer. Key pathways identifi ed include the p53, nuclear factor-κB, DNA repair and cell cycle regulation pathways.
Conclusions Our results demonstrate that integrating GWAS information with gene expression data can be an eff ective approach for identifying biological pathways that are relevant to TNBC. These could be potential targets for the development of novel therapeutic strategies.

P36
Novel purifi cation reagents for the study of the human microbiome Nucleic-acid-based techniques such as hybridization, PCR, quantitative PCR and next-generation sequencing off er a rapid and highly sensitive alternative to culture-based techniques for detecting pathogenic bacteria directly in specimens. However, in addition to the inherent limitations of amplifi cation and identifi cation of biological samples, the human genome itself may interfere with pathogen detection and diagnosis, owing in part to the higher proportion of human genomic DNA that is present relative to the target microbiome. As a consequence, analyses of a metagenome or microbiome from clinical samples using next-generation sequencing or PCR are ineffi cient, diffi cult and time-consuming. We have developed a unique method for the separation of large pieces of human DNA (about 20 kb) from similar sizes of Escherichia coli DNA (about 20 kb) using methyl-binding protein 2a (MBD2a) fused to the Fc portion of a human antibody heavy chain (MBD2a-Fc). This MBD2a-Fc protein is expected to bind to a hydrophobic protein-A-coated magnetic bead that has been engineered to show almost no or no non-specifi c DNA binding. The bacterial/human DNA is added to the beads and incubated for 15 minutes, and then a magnetic fi eld is applied to capture MBD2a-Fc matrix bound to methylated DNA (for example, human DNA). The majority (about 95%) of the human DNA, which is CpG methylated, is expected to remain bound to the magnetic bead matrix, whereas the bacterial DNA, which is generally CpG methylation free, remains in the supernatant. The recovery rate of the input microbial DNA is >95%. The supernatant, which is enriched for microbial DNA, is fractionated by sonication or restriction enzyme digestion. A second reagent, CXXC-Fc, an unmethylated CpG moiety binding protein, further concentrates the bacterial DNA fraction. We have established methodologies whereby these reagents can be used to decode entire microbiomes in a cost-eff ective manner using existing next-generation sequencing platforms and newer single-molecule sequencing technologies. Sequence analysis from Ion Torrent data will be presented. The availability of large microbiome datasets will allow us to identify unique markers specifi c to bacterial species, as well as single nucleotide polymorphism targets that are associated with normal and disease states. Additionally, we plan to develop a rapid in situ solid phase platform using the above reagents to lyse, enzymatically fractionate, purify and concentrate minuscule quantities of bacterial DNA from blood or other bodily fl uids from mammalian species. We believe these new purifi cation reagents will have a major impact on the broader biomedical community interested in studying the human microbiome and host-pathogen interactions. The clinical reality of the post-genomic era is that we now face even more complex disease processes when provided with genomic information, including multifactorial genetic and genomic infl uences, and epigenetic and environmental factors. A useful example of the promise and perils of genomic technologies and information is breast cancer. By the mid-1990s, two genes (BRCA1 and BRCA2) had been identifi ed, accounting for approximately 5% of aff ected individuals. Since then, surprisingly few genetic breast cancer risk factors have been identifi ed to account for the remaining 95%. To effi ciently and cost-eff ectively identify individuals at high risk, a combination of information components is required: a patient-reported personal and family medical history; clinical data (for example, a physical exam, pathology results, laboratory test results and imaging); and genetic/genomic results. Gaining comprehensive data from all of these areas provides the best risk assessment and management options for patients. Furthermore, high quality patient and clinical information is essential for the accurate and reliable interpretation of genomic results.

An amalgamated risk estimation model (REM) and assay integration into future REMs
We have clinically implemented a platform that integrates all three informational components with multiple risk estimation models (REMs) to produce an eff ective automated method for risk-stratifying patients.
Although this platform can be and has been applied to a wide range of genetic conditions, this presentation will use breast cancer to illustrate the approach. This system consists of three primary components: a secure web-based questionnaire used by patients to enter personal and family medical history; a tablet-based system for collecting clinical and genomic information; and an analysis engine that seamlessly integrates The new and emerging fi eld of systems medicine, an application of systems biology approaches to biomedical problems in the clinical setting, leverages complex computational tools and high dimensional data to derive personalized assessments of disease risk. Systems medicine off ers the potential for more eff ective individualized diagnosis, prognosis and treatment options. The Georgetown Clinical & Omics Development Engine (G-CODE) is a generic and fl exible web-based platform that serves to allow basic, translational and clinical research activities by integrating patient characteristics and clinical outcome data with a variety of high-throughput research data in a unifi ed environment to enable systems medicine. Through this modular, extensible and fl exible infrastructure, we can quickly and easily assemble new translational web applications with both analytic and generic administrative features. New analytic functionalities specifi c to the needs of a particular disease community can easily be added within this modular architecture. With G-CODE, we hope to help enable the creation of new disease-centric portals, as well as the widespread use of biomedical informatics tools by basic, clinical and translational researchers, through providing powerful analytic tools and capabilities within easy-to-use interfaces that can be customized to the needs of each research community. This infrastructure was fi rst deployed in the form of the Georgetown Database of Cancer (G-DOC) [1], which includes a broad collection of bioinformatics and systems biology tools for analysis and visualization of four major omics types: DNA, mRNA, microRNA and metabolites. Although several rich data repositories for high dimensional research data exist in the public domain, most focus on a single data type and do not support integration across multiple technologies. G-DOC contains data for more than 2,500 patients with breast cancer and almost 800 patients with gastrointestinal cancer, all of which are handled in a manner that allows maximum integration. We believe that G-DOC will help facilitate systems medicine by allowing easy identifi cation of trends and patterns in integrated datasets and will hence facilitate the use of better targeted therapies for cancer. One obvious area for expansion of the G-CODE/G-DOC platform infrastructure is to support next-generation sequencing (NGS), which is a highly enabling and transformative emerging technology for the biomedical sciences. Nonetheless, eff ective utilization of these data is impeded by the substantial handling, manipulation and analysis requirements that are entailed. We have concluded that cloud computing is well positioned to fi ll these gaps, as this type of infrastructure permits rapid scaling with low input costs. As such, the Georgetown University team is exploring the use of the Amazon EC2 cloud and the Galaxy platform to process whole exome, whole genome, RNA-Seq and chromatin immunoprecipitation (ChIP)-Seq NGS data. The processed NGS data will be integrated into G-DOC to ensure that they can be analyzed in the full context of other omics data. Likewise, all G-CODE projects will simultaneously benefi t from these advances in NGS data handling. Through technology re-use, the G-CODE infrastructure will accelerate progress in a variety of ongoing programs that are in need of integrative multi-omics analysis and will advance our opportunities to practice eff ective systems medicine in the near future.
genome. Compared with the genome sequence of a non-virulent strain, 129Pt, a disproportionate number of sRNAs (about 30%) were located in a genomic region unique to strain 2336 (accounting for about 18% of the total genome). This observation suggests that a number of the newly identifi ed sRNAs in strain 2336 may be involved in strain-specifi c adaptations that could include virulence. Conclusions Overall, this study describes an RNA-Seq-based transcriptome map of H. somni, an important agricultural pathogen, that was constructed to identify functional genomic elements. Our genome-wide survey predicts numerous novel expressed regions that need to be characterized biologically to improve our understanding of disease pathogenesis. A description of all of the functional elements in the H. somni system is a prerequisite for using holistic systems approaches to understand the complex pathogenesis of bovine respiratory disease. Background In this work, we study the benefi ts of using optical maps to improve genome assembly. Many modern assembly algorithms rely on a de Bruijn graph paradigm to reconstruct a genome from short reads. Ambiguities caused by repeats within the genome cause the fi nal assembly to be broken up into many contigs, because the assembler does not have enough information to fi nd the one correct traversal of the graph. Optical mapping technology can be useful for determining the correct path in the de Bruijn graph, through providing estimates on the locations of one or more restriction enzyme patterns in the genome, thereby constraining the possible traversals of the graph to only those that are consistent with the map. A particular traversal that does not align well with the optical map can be discarded as incorrect. Previous work has shown how to construct optical maps [1,2] for scaff olding contigs [3]. Methods Our algorithm relies on a depth-fi rst search strategy. As the depthfi rst search proceeds and its corresponding sequence is extended, we check whether the resultant sequence would generate an optical map that matches the optical map of the genome. If the candidate in silico optical map matches the optical map of the genome, we proceed with the depth-fi rst search. Otherwise, we backtrack in the depth-fi rst search until we fi nd a path that covers the entire graph and whose sequence has an optical map that matches the optical map of the entire genome. Although the total number of paths in the de Bruijn graph can be exponential in the number of nodes and edges in the graph [4], a reference optical map can eff ectively prune the search space of paths. To improve performance, we start by fi nding edges in the de Bruijn graph that can be uniquely placed on the optical map. These edges, which we call landmark edges, can also help guide our depth-fi rst search. Although there may be multiple paths in the de Bruijn graph that can yield sequences with optical maps that match the genome's optical map, these paths all yield very similar sequences in most cases.
Results Given modest assumptions about the errors in the optical map, initial simulations show that our algorithm is very eff ective at assembling bacterial genomes, given read lengths of 100 or longer. The majority of our assemblies match the original sequences used in our simulations very closely. We will also present the results of simulations aimed at measuring the eff ect of errors on the correctness of the reconstruction and at measuring how the choice of restriction enzymes can improve the sequence assembly. Conclusions Our work shows that optical maps can be used eff ectively to aid in genome assembly. We are currently extending our approach to handle much larger graphs and to tolerate higher amounts of mapping error. In our fi nal assembly, we would also like to be able to detect and mark regions that we are less certain about and regions that we are confi dent are correct.
Rare genetic variants of large eff ect may confer a substantial genetic risk for common diseases and complex traits. There is considerable interest in sequencing limited genomic regions such as candidate genes and target regions identifi ed by genetic linkage and/or association studies. Nextgeneration sequencing of pooled DNA samples is an effi cient way to identify rare variants in large sample sets. Although sample pooling can reduce the labor and cost of sequencing, it also reduces the sensitivity and specifi city for eff ective and reliable identifi cation of rare variants. It remains a challenge to solve these problems using the available computational genomics tools. We have developed an eff ective Illumina-based sequencing strategy using pooled samples and have optimized a novel base-calling algorithm, Srfi m, and a variant-calling algorithm, SERVIC 4 E (Sensitive Rare Variant Identifi cation by Cross-pool Cluster, Continuity & Tail-Curve Evaluation). SERVIC 4 E analyzes base composition by cycle or tail-curves across sample pools and employs multiple fi ltering strategies, including quality and continuity cluster analysis, average quality fi ltering, tail-curve fi ltering and error proximity fi ltering, to accurately identify rare sequence variants. We validated these algorithms using two independent Illumina sequence datasets generated from diff erent pool sizes, read lengths and sequencing chemistries. Using these programs, we identifi ed 32 coding variants, including 14 present only once over 24 exon-containing regions in one sample cohort (n = 480), and 41 coding variants, including 16 present only once in the same regions in an unrelated cohort (n = 480). Validation of these variants by Sanger sequencing revealed an excellent combination of sensitivity (97.8% and 96.4%) and specifi city (84.9% and 93.8%) for variant detection in pooled samples from both cohorts, respectively. Data from these studies showed that our algorithms compare favorably with the available programs, including SAMtools, SNPSeeker, CRISP and Syzygy, for the eff ective and reliable detection of rare variants in pooled samples. Background Acute myeloid leukaemia (AML) is a malignancy of the leukocytes or white blood cells. It represents a group of clonal haematopoietic stem cells having uncontrolled proliferation potential and at the same time defective for diff erentiation. Diff erentiation therapy is the most successful therapy till date available for the treatment of leukemia. Unlike chemotherapy and other therapeutic approaches, diff erentiation therapy mainly uses the cancerous cells to perform normal function by inducing them to diff erentiate and mature into terminally diff erentiated forms. All trans-retinoic acid (ATRA), Vitamin-D3 and Arsenic trioxide are the clinically proven therapies available for the treatment of AML. However, they are associated with several severe side eff ects. Therefore, novel alternate therapeutic drug development is an active area of investigation. Recent advances in genomics and proteomics has tremendous implication in drug development and understanding of its molecular mechanism in a reasonably short time period. Materials and methods Momordica charantia seed extract was extracted using organic solvents and purifi ed by TLC and NP-HPLC. The bioactive purifi ed fraction was designated as P3 (Peak fraction 3). HL60 cells were treated with P3 and diff erentiation was estimated by NBT (Nitro Blue Tetrazolium) reduction assay and lineage of diff erentiation was confi rmed by Wright-Giemsa staining and cell surface marker staining using by fl ow cytometry analysis. Signaling pathway was identifi ed by kinase inhibitor treatment followed by P3 mediated induction of diff erentiation.

P46
Transcriptome profi ling of the diff erentiating cells were performed by cRNA microarray with non-treated HL60 cells as a control and Aff ymatrix chips for hybridization of cRNA.

Results
In the present study we have shown that P3 was able to diff erentiate HL60 cells into granulocytes by Wright-Giemsa staining, which was further confi rmed by up regulation of CD15 and down regulation of CD14 by fl ow cytometry. Unlike other available therapeutic agents, P3 is not toxic even at 50 μg/ml concentration and resulted in ~56% of cell diff erentiation. Further, a detailed study of the change of expression of various genes involved in diff erentiation was done by c-RNA microarray. Microarray data suggested that 804 genes are up regulated, specifi cally in P3 treated cells, most of which are related to the metabolic pathway essential for diff erentiated cells and apoptosis. Moreover, 467 genes are down regulated and these down regulated genes are generally associated with cell cycle progression. Diff erent kinase inhibitors were used to identify the signaling pathway involved in P3 mediated diff erentiation and we found that P3 follows MAPK pathway which was confi rmed by PD98059 dependent inhibition of P3 activity. This result was supported by up regulation of MAPK associated genes in transcriptome analysis.
Conclusion Present study reports for the fi rst time, a detailed study of an alternate drug, isolated from plant and its eff ect at signaling level as well as on the global gene expression of the diff erentiating leukemic cells. This study has potential implication for targeted drug designed for the treatment of Leukemia as well as other forms of cancer.