Genomes or exomes: evaluation of cost, time and coverage

The field of human genetics is being reshaped by exome and genome sequencing. Several lessons are evident from observing the rapid development of this area over the past 2 years, and these may be instructive with respect to what we should expect from 'next-generation human genetics' in the next few years.

Cancer is driven by mutation. Using massively parallel sequencing technology, we can now sequence the entire genome of cancer samples, allowing the generation of comprehensive catalogs of somatic mutations of all classes. Bespoke algorithms have been developed to identify somatically acquired point mutations, copy number changes and genomic rearrangements, which require extensive validation by confi rmatory testing. The fi ndings from our fi rst handful of genomes illustrate the potential for next-generation sequencing to provide unprecedented insight into mutational processes, cellular repair pathways and gene networks associated with cancer development. I will also review the possible applications of these technologies in a diagnostic and clinical setting and the potential routes for translation. Massively parallel sequencing is transforming our knowledge of cancer, yet the medical value of next-generation approaches has not been fully established. From a technical perspective, it is easy to envisage that, within a few years, the primary diagnostic approach for all cancers will be to assess a partial or whole cancer genome sequence; however, the adoption of this approach will ultimately depend on the development of robust and valid models for the tailoring of therapy. Thus, within a short period, the focus of genomic investigation will shift from the current emphasis on discovery in poorly annotated datasets, such as The Cancer Genome Atlas, to ambitious investigations that focus on precise clinical questions. This transition will occur in two stages. The fi rst stage will be a retrospective, 'genome-backward' approach, in which patients are treated blind to genomics but consent to prospective germline and tumor sequencing, as well as data sharing. In this way, models that use mutation patterns to predict treatment outcomes can be developed. In a later prospective, 'genome-forward' phase, therapeutic postulates that arise from genome sequencing will be used as the basis for clinical trial eligibility or stratifi cation. Specifi c examples of how these approaches are being studied in breast cancer will be discussed.
Recent studies have indicated that humans have an exceptionally high per-generation mutation rate of 7.6 × 10 -9 to 2.2 × 10 -8 . These spontaneous germline mutations can have serious phenotypic consequences when aff ecting functionally relevant bases in the genome. In fact, their occurrence may explain why cognitive disorders with a severely reduced fecundity, such as mental retardation, remain frequent in the human population, especially when the mutational target is large and comprises many genes. This would explain a major paradox in the evolutionary genetic theory of these disorders. In this presentation, I will describe our recent work on using a family-based exome sequencing approach to test this de novo mutation hypothesis in ten patients with unexplained mental retardation [1]. Unique nonsynonymous de novo mutations were identifi ed and validated in nine genes. Six of these, identifi ed in diff erent patients, were likely to be pathogenic based on gene function, evolutionary conservation and mutation impact. The clinical relevance of these novel genes, and the ultimate proof that they cause disease, lies in the identifi cation of de novo mutations in additional patients with a similar phenotype. As such, we are currently screening approximately 1,200 patients with unexplained mental retardation for mutations in YY1, which is one of these newly identifi ed genes. In addition, we are extending our family-based exome sequencing approach to 100 patients to establish the diagnostic yield for de novo mutations in patients with unexplained mental retardation. These fi ndings, when replicated, provided strong experimental support for a de novo paradigm for mental retardation. Together with de Diseases of the vaginal tract result from perturbations of the complex interactions among microbes of the host vaginal ecosystem. Recent advances in our understanding of these complex interactions have been enabled by next-generation-sequencing-based approaches, which make it possible to study the vaginal microbiome. In harnessing these approaches, we are beginning to defi ne what constitutes an imbalance of the vaginal microbiome and how such imbalances, along with associated host factors, lead to infection and disease states such as bacterial vaginosis (BV), preterm births, and susceptibility to HIV and other sexually acquired infections. We have exploited various approaches to this end: comparative analysis of reference microbial genomes of vaginal isolates; comparative microbiome, metabolome and metagenome analysis of vaginal communities from subjects deemed to be healthy and individuals with BV; and comparative microbiome analysis of vaginal communities from humans and non-human primate species. The results from comparative genome sequencing have led us to suggest that diff erent strains of the proposed pathogen Gardnerella vaginalis have diff erent virulence potentials and that the detection of G. vaginalis in the vaginal tract is not indicative of a disease state [1]. Comparative microbiome, metabolome and metagenome analysis of vaginal communities from humans has demonstrated that the microbial communities from subjects with BV have a defi ned bacterial composition and metabolic profi le that is distinct from subjects who do not have BV [2 and unpublished observations]. Our studies of microbial communities from non-human primate species and humans provide a unique comparative context. From an evolutionary perspective, humans and non-human primates diff er considerably in mating habits, estrus cycles and gestation period. Moreover, birth is diffi cult in humans relative to other primates, increasing the risks of maternal injury and infection. In light of these numerous diff erences between humans and non-human primates, we hypothesize that humans have microbial populations that are distinct from those of non-human primates. Preliminary results show that the vaginal microbiomes of non-human primates are more diverse and are compositionally distinct from human vaginal microbiomes [3,4]. The composition of bacterial genera found in non-human primates is dissimilar to that seen in humans, most notably with lactobacilli being much less abundant in non-human primates. Our observations point to vaginal microbial communities being an important component of an evolutionary set of adaptations that separates humans from other primates and is of fundamental importance to health and reproductive function. For more than a decade, the Joint Center for Structural Genomics (JCSG) [1] has been at the forefront of developing tools and methodologies that allow the application of high-throughput structural biology to a broad range of biological and biomedical investigations. In the previous phases of the National Institutes of Health's Protein Structure Initiative (PSI; 2000 to 2010) [2], we explored structural coverage of uncharted regions of the protein universe [3], as well as a single organism, allowing complete structural reconstruction of the metabolic network of Thermotoga maritima [4]. In the current phase (PSI:Biology; 2010 to 2015), the JCSG is leveraging its high-throughput platform to explore the structural basis for host-microbe interactions in the human microbiome. The emerging fi eld of metagenomics has been particularly enlightening: the human gut microbiome sequencing projects have already uncovered fascinating new families and expansions of known families for adaptation to this environment. The gut microbiota is dominated by poorly characterized bacterial phyla, which contain an unusually high number of uncharacterized proteins that are largely unstudied. Their infl uence upon human development, physiology, immunity and nutrition is only starting to surface and is thus an exciting new frontier for structural genomics, where we can structurally investigate the contributions of these microorganisms to human health and disease. The JCSG is located Next-generation sequencing of RNA (RNA-Seq) is a powerful tool that can be applied to a wide range of biological questions. RNA-Seq provides insight at multiple levels into the transcription of the genome. It yields sequence, splicing and expression-level information, allowing the identifi cation of novel transcripts and sequence alterations. We have been developing and comparing methods for samples that present a challenge: that is, those with low quantity and/or quality RNA. RNA-Seq methods that start from total RNA and do not require the oligo(dT) purifi cation of mRNA will be valuable for such challenging samples. Such methods use alternative approaches to reduce the fraction of sequencing reads derived from rRNA. We will present results from multiple approaches, including the use of not-so-random (NSR) primers for cDNA synthesis, low-C 0 t hybridization with a duplex-specifi c nuclease for light normalization and NuGEN's Ovation RNA-Seq kit. We demonstrated that these three methods successfully reduce the fraction of rRNA to less than 13%, even when starting from degraded RNA. We compared the performance between these methods and with 'gold standard' RNA-Seq data (derived from samples with large quantities of high-quality RNA), using quantitative criteria that evaluate eff ectiveness for genome annotation, transcript discovery and expression profi ling. The application of these methods to samples that contain degraded RNA and/or very low input amounts of RNA will also be presented.

Viral diversity in children with diarrhea in Gambia
Irina Astrovskaya 1 , Bo Liu 1 and Mihai Pop 1

Results
We were able to detect and assemble sequences from known diarrhea-causing viruses (such as rotaviruses, adenoviruses and noroviruses), known human viruses (such as herpesviruses and enteroviruses) and potential diarrhea-causing viruses (such as bocaviruses, astroviruses and parechoviruses). These fi ndings were consistent with independent virology results.
In some clinical cases, sequences from classic viruses were found, but the virology results were negative. COSMIC provides a large number of graphical and tabular views for interpreting and mining the large quantity of information, as well as the facility to export the relevant data in various formats. The website can be navigated in many ways to examine mutation patterns on the basis of genes, samples and phenotypes, which are the main entry points to COSMIC. COSMIC also provides various options to browse the data in a genomic context. Integration with the Ensembl genome browser allows the visualization of full genome annotations, together with COSMIC data, on the GRCh37 genome coordinates. COSMIC also contains its own genome browser, which facilitates data analysis by combining genome-wide gene structures and sequences with rearrangement breakpoints, copy number variations and all somatic substitutions, deletions, insertions and complex gene mutations. The main COSMIC website [1] encompasses all of the available data. However, within COSMIC, the Cancer Cell Line Project [3] is a specialized component, which provides details of the genotyping of almost 800 commonly used cancer cell lines, through the set of known cancer genes. Its focus is to identify driver mutations, or those likely to be implicated in the oncogenesis of each tumor. This information forms the basis for integrating COSMIC with the Genomics of Drug Sensitivity in Cancer project [4], which is a joint eff ort with the Massachusetts General Hospital [5] to screen this panel of cancer cell lines against potential anticancer therapeutic compounds to investigate correlations between somatic mutations and drug sensitivity.
Data on somatic mutations in cancer are being produced at a rapidly increasing rate, and the combined analysis of large distributed datasets is becoming ever more diffi cult. However, COSMIC curates and standardizes this information in a single database, providing user-friendly browsing tools and analytical functions, thus ensuring its role as a key resource in human cancer genetics. Background Recent genome-wide association studies (GWAS) have identifi ed allele T of a single nucleotide polymorphism (SNP), rs2294008, in the prostate stem cell antigen (PSCA) gene as a risk factor for bladder cancer [1,2]. In the present study, we aimed to fi nd additional disease-associated SNPs in the PSCA region and to explore their possible molecular function. Methods Based on information from the 1000 Genomes and HapMap 3 projects, we performed imputation analysis on 3,532 bladder cancer cases and 5,120 healthy controls of European ancestry from the stage 1 bladder cancer GWAS, within ±100 kb of the region fl anking the GWAS signal, rs2294008. The average allele dosage and best-guess genotypes were estimated and tested for association between SNP variants and bladder cancer risk by using unconditional logistic regression. Functional follow-up studies included RNA sequencing in normal and tumor bladder samples and electrophoretic mobility shift assays to examine the potentially altered DNAprotein interactions for SNPs of interest. Results A total of 639 imputed and 37 genotyped SNPs within ±100 kb of the region of the original GWAS signal were tested for genetic association with bladder cancer. In these stage 1 GWAS samples, the SNP rs2294008 had a per-allele odds ratio (OR) of 1.09 (95% confi dence interval (CI) = 1.02 to 1.16, P = 6.93 10 −4 ). Multivariable logistic regression analysis adjusted for the study center, age, gender, smoking status and rs2294008 genotype revealed a novel associated variant, rs2978974 (OR = 1.11, 95% CI = 1.04 to 1.19, P = 1.62 × 10 −3 ). There was low linkage disequilibrium between rs2978974 and the original GWAS signal, rs2294008 (D' = 0.19, r 2 = 0.02). Only individuals carrying the risk variant of both SNPs had an increased risk of bladder cancer (OR = 1.24, 95% CI = 1.13 to 1.35, P = 4.69 × 10 −6 ) and not individuals who carried a risk variant of only one of the SNPs (P > 0.05). Stratifi ed analysis suggested that this compound eff ect of rs2294008 and rs2978974 was more signifi cant in males (OR = 1.27, P = 2.80 × 10 −6 ) than in females (OR = 1.08, P = 0.52). rs2978974 resides 10 kb upstream of rs2294008, is marked by an H3K4me3 signal and is in the vicinity of an androgen-receptor-binding site. Using RNA sequencing of bladder samples, we showed that rs2978974 is located within an alternative, untranslated fi rst exon of PSCA. Using the electrophoretic mobility shift assay with nuclear proteins from LNCaP and HeLa cells, we observed that the non-risk-associated allele (G) of rs2978974, but not the risk allele (A), could bind to ELK1, a protein belonging to the ETS family of transcription factors. Conclusions We identifi ed a SNP, rs2978974, in the PSCA region as a novel marker for bladder cancer susceptibility. There was a compound eff ect in carriers of both the rs2294008 and rs2978974 risk variants. The functional relevance of rs2978974 might be related to the loss of ELK1 regulation by the risk allele (A) and diff erential regulation of PSCA mRNA expression. Background Dinofl agellates are a diverse group of ecologically important eukaryotic algae, the global impact of which ranges from the large-scale primary production of oxygen [1] to devastating toxic algal blooms [2]. These organisms have exceptionally large genomes (10 9 to 10 11 bases) [3] and highly duplicated genes (which can occur thousands of times within a single genome) [4]. These and other unusual characteristics have made dinofl agellates diffi cult to study using traditional molecular biology techniques. Sequence data for dinofl agellates are correspondingly sparse, and not a single genome sequence has been published to date. As part of our project called Assembling the Dinofl agellate Tree of Life (DAToL), our laboratory has sequenced the transcriptome of Polarella glacialis. Its genome is estimated to be only 3 Gb in size, making it one of the smallest known dinofl agellate genomes. Because we had to rely on de novo assemblers that had been tested using data from organisms that are extremely divergent from dinofl agellates, we took special care in our attempts to validate the data. Before expanding our analyses to include additional dinofl agellates, we compared the results from diff erent sequencing and assembly methods. Methods Total RNA was extracted from cultured P. glacialis. This sample was then divided and shipped to Macrogen for rRNA degradation, library preparation and sequencing. One library was sequenced on one-eighth of a Roche/454 GS FLX picotiter plate using Titanium chemistry. A second library was sequenced using one lane on an Illumina GAIIx sequencer for 78 cycles in both directions (paired end). The sequences were assembled using Newbler, MIRA, Oases and Trinity, and they were analyzed using various custom scripts.

Results
The total amount of unassembled 454 sequence data added to less than one-third of the combined lengths of only those Trinity transcripts that had a signifi cant BLAST hit against a sequence in GenBank, indicating that we did not achieve complete coverage with our 454 data. Conclusions Our primary hypothesis was that the longer read lengths of the 454 data might allow the corresponding assemblers to better resolve repetitive sequences, which could be instrumental for assembling conserved regions within highly duplicated genes. Our failure to obtain complete coverage with the 454 dataset undermined our ability to test this hypothesis, although we made several other interesting observations. Notably, despite the vas t disparity in the depth of the coverage between the 454 and Illumina assemblies, we observed unique, apparently real sequences within some of the 454 contigs.  few days. However, the sequencing results always turn out to contain several hundred contigs. A multiplex PCR procedure is then needed to fi ll all of the gaps and to link the contigs into one full-length genome sequence [1][2][3][4][5][6][7][8][9][10]. The full-length prokaryotic genome sequence is the gold standard for comparative prokaryotic genome analysis. This study assessed pyrosequencing strategies by using a simulation with 100 prokaryotic genomes.

Results
Our simulation shows the following: fi rst, a single-end 454 Jr Titanium run combined with a paired-end 454 Jr Titanium run may assemble about 90% of 100 genomes into <10 scaff olds and 95% of 100 genomes into <150 contigs; second, the average contig N50 size is more than 331 kb (Table 1); third, the average single base accuracy is >99.99% (Table 1); fourth, the average false gene duplication rate is <0.7% (Table 1); fi fth, the average false gene loss rate is <0.4% (Table 1); sixth, the total size of long repeats Genome Biology 2011, 12(Suppl 1) http://genomebiology.com/supplements/12/S1 (both repeat length >300 bp and >700 bp) is signifi cantly correlated to the number of contigs (Table 4); and, seventh, increasing the read length of a pyrosequencing run could improve the assembly quality signifi cantly (Table  1-3).
Conclusions A single-end 454 Jr run combined with a paired-end 454 Jr run is a good strategy for prokaryotic genome sequencing. This strategy provides a solution to producing a high-quality draft genome sequence of almost any prokaryotic organism, selected at random, within days. It could be the fi rst step to achieving the full-length genome sequence. It also makes the subsequent multiplex PCR procedure (for gap fi lling) much easier, aided by the knowledge of the orders/orientations of most of the contigs. As a result, large-scale full-length prokaryotic genome-sequencing projects could be fi nished within weeks. Background A recent genome-wide association study (GWAS) identifi ed a single nucleotide polymorphism, rs8102137, located 6 kb upstream of the cyclin E1 gene (CCNE1) on chromosome 19q12, as a risk factor for bladder cancer (odds ratio (OR) = 1.13, P = 1.7 × 10 −11 ) [1]. CCNE1 encodes a cell cycle protein that regulates cyclin-dependent kinases and is therefore an important cancer susceptibility gene. Methods This study used 42 bladder tumor samples and 41 normal bladder tissue samples (24 matched normal-tumor pairs), HeLa cells and several prostate and bladder cancer cell lines. Genotyping of rs8102137 in DNA and rs7257694 in both DNA and cDNA samples was performed using an allelic discrimination genotyping assay. TaqMan and SYBR Green assays were used to measure the expression of the diff erent CCNE1 isoforms. The CCNE1 isoforms were cloned into a pFC14A (HaloTag) CMV Flexi Vector. Protein expression of CCNE1 isoforms in normal and tumor bladder tissues and transfected cells was analyzed by western blotting. Subcellular localization of recombinant CCNE1 splicing forms was analyzed by confocal microscopy.
Results CCNE1 mRNA was expressed at a higher level in bladder tumors (n = 42) than in adjacent normal bladder tissue samples (n = 41, 3.7fold, P = 2.7 × 10 −12 ). However, no association was found between mRNA expression level and the genotype of rs8102137. We observed strong allelic expression imbalance for a synonymous coding variation located in the last exon (rs7257694, Ser390Ser), which is in high linkage disequilibrium with rs8102137 (normal bladder tissue samples, n = 41, D' = 1.0, r 2 = 0.815; HapMap CEU samples, n = 60, D' = 0.95, r 2 = 0.68). In normal and tumor tissue samples heterozygous for both single nucleotide polymorphisms, the risk variant of rs8102137 was associated with lower expression of allele T of rs7257694 (normal samples, P = 2.2 × 10 −4 ; tumor samples, P = 1.11 × 10 −10 ). Western blotting analysis of bladder tissue and prostate cell line lysates revealed that the allelic expression imbalance is likely to be related to two CCNE1 protein isoforms that showed a diff erential pattern of expression dependent on the rs8102137 and rs7257694 genotype. We have cloned the alternative splicing forms of CCNE1 and are currently evaluating their functional relevance.
Conclusions Our results suggest that bladder-cancer-associated genetic variants of the CCNE1 gene might contribute to altered cell cycle regulation, owing to diff erential mRNA splicing producing diff erent protein isoforms of CCNE1. Background Metagenomics has allowed the study of a wide range of microbial communities, from those within the sea [1,2] to those of the human body [3]. Increasingly, de novo assembly is the fi rst step in the analysis of these metagenomic samples. As the targets have increased in complexity, computational tools have started to emerge [4,5] to address the challenges presented by the assembly of these datasets. Although the targets and analyses have become more complex, the means of presenting the results has remained the same: a multi-FASTA text fi le. This presentation hides the variation that is present in the sampled biological community. The ability to navigate and view the complexity of a genomic sample may help drive novel biological insights. Here, we present a graphical visualization tool that allows the visual inspection of genome assembly graphs and the characterization of the genomic variation that is present in these graphs (that is, the diff erences between two or more related haplotypes commonly found in metagenomes or higher eukaryotes). Methods Our software, Scaff Viz [6], is open source and was developed as a plug-in for the Cytoscape graph viewer package [7,8]. Our assembly view represents assembly metadata within node/edge attributes. For example, node height corresponds to coverage (the amount of oversampling of a sequence), and node width is proportional to the length of the sequence. We support assemblies from Celera Assembler [9], Newbler [10], Bambus 2 and MetAMOS. The creation and initialization of Cytoscape objects is abstracted to allow a developer to easily add new assembly result formats without knowledge of Cytoscape's API. We developed a layout algorithm based on information from the assembler on node position, orientation and length. Scaff Viz allows users to show (or hide) an arbitrary subset of nodes. The viewer can also output genome sequence that corresponds to any subset of the graph, including all alternative sequences present in all selected subpaths. We believe that this representation may prove to be instrumental in fi nding and characterizing structural variants such as alternative genes, alternative regulatory units or mobile genomic elements.

Results
We evaluated the performance of Scaff Viz on seven datasets of varying size and complexity. We report that the run time is approximately linear with respect to the number of elements in the graph (nodes + edges). The memory scales linearly with respect to the number of nodes. Extrapolating from these factors, a graph of 250,000 contigs can be opened in approximately 2 minutes using approximately 2.5 GB of memory. Scaff Viz is scalable to large graphs and can be run on a laptop. Conclusions We have developed a novel open-source assembly graph viewer, Scaff Viz, as a plug-in for Cytoscape. Scaff Viz supports the output of several popular assembly programs and is scalable to large metagenomic assemblies on a laptop.
Most of the DNA viruses in the gastrointestinal tract are phages, which infect bacterial hosts. Despite phages being the most abundant organisms on Earth, as well as extremely active players in the global ecosystem, much remains unknown about how they function in their natural environments. Advances in whole genome sequencing technologies have generated a large collection of hundreds of phage genomes, allowing deep insight into the genetic evolution of phages, and metagenomics technologies seem to promise more rewarding glimpses into their life cycles and community structures.
Recently, we developed an automated approach to assemble a collection of orthologous gene clusters of double-stranded DNA phages (phage orthologous groups, or POGs). This approach follows the well-known clusters of orthologous groups (COGs) framework to identify sets of orthologs by examining top-ranked sequence similarities between proteins in complete genomes without the use of arbitrary similarity cutoff s, and it thus represents a natural system for examining fast-evolving and slow-evolving proteins alike. This automated approach was designed to keep pace with the rapid and accelerating growth of whole genome information from sequencing projects. In particular, we employ a faster graph-theoretical COG-building algorithm that vastly improves our ability to deal with larger numbers of genomes (N) by reducing the worst-case complexity from O(N 6 ) to O(N 3 × log N). This system encompasses more than 2,000 groups from the almost 600 known phage genomes deposited at the National Center for Biotechnology Information and is in the process of being expanded to include singlestranded DNA phages and single-and double-stranded RNA phages.
Using this approach, we found that more than half of the POGs have no or very few evolutionary connections to their cellular hosts, indicating that these phages combine the ability to share and transduce the host genes with the ability to maintain a large fraction of unique, phage-specifi c, genes. Such genes are useful for targeted research strategies: for example, as diagnostic indicators and fundamental units of systems biology studies. We employed this set of phage-specifi c genes to probe the composition of several oceanic metagenomic samples. Although virus-enriched samples indeed contain more homologous matches to phage-specifi c POGs than a full metagenomic sample also containing cellular DNA, the total gene repertoire of the marine DNA virome is dramatically diff erent from that of known phages. In particular, it is dominated by rare genes, many of which might be contained within viruslike entities such as cellular gene transfer agents rather than true viruses. This result might suggest the necessity of radically rethinking what constitutes the 'virus world' , because the major component of (marine) viromes could be gene transfer agents that encapsidate bacterial and archaeal genes. Background Recent genome-wide association studies have led to the reliable identifi cation of single nucleotide polymorphisms (SNPs) at a number of loci associated with an increased risk of developing specifi c common human diseases. Each such locus implicates multiple possible candidate SNPs as being involved in the disease mechanism, and determining which SNPs actually contribute, and by what mechanism, is a major challenge. A variety of mechanisms may link the presence of a SNP to altered in vivo gene product function and hence contribute to disease risk. We have analyzed the role of one of these mechanisms, nonsynonymous SNPs (nsSNPs) in proteins, for associations found in the Wellcome Trust Case-Control Consortium (WTCCC) study of seven common diseases [1] and the follow-up work. Methods Using HapMap data and linkage disequilibrium information, we identifi ed all possible candidate SNPs associated with increased disease risk. We then applied two computational methods [2,3], based on analysis of protein structure and sequence, to determine which of these SNPs has a signifi cant impact on in vivo protein function (SNPs3D) [4].
Results Several of these disease-associated loci were found to be linked to one or more high-impact nsSNPs. In some cases, these SNPs are in wellknown proteins (such as human leukocyte antigens). In other cases, they are in less well-established disease-associated genes (for example, MST1 for Crohn's disease), and in yet others, they are in proteins that have been poorly investigated (for example, gasdermin B, also for Crohn's disease). Approximately 55% of these disease-associated loci have at least one nsSNP, and about 33% of them have at least one high-impact nsSNP in those regions.
Conclusions Together, these data suggest a signifi cant role for nsSNPs in Background A major goal of metagenomics is to characterize the taxonomic composition of an environment. The most popular approach relies on 16S rRNA sequencing; however, this approach can generate biased estimates owing to diff erences in the copy number of the gene, even between closely related organisms, and owing to PCR artifacts. In addition, the taxonomic composition can also be determined from metagenomic shotgun sequences by matching reads against a database of reference sequences. One major limitation of the computational methods that have been used for this purpose is the use of a universal classifi cation threshold for all genes at all taxonomic ranks. Methods We present a novel taxonomic profi ler for metagenomic sequences, MetaPhyler [1], which relies on 31 phylogenetic marker genes as a taxonomic reference. Because genes can evolve at diff erent rates and because shotgun reads contain gene fragments of diff erent lengths, we propose that better classifi cation results can be obtained by tuning the taxonomic classifi er to the length of the gene fragment, to a particular gene and to the taxonomic rank. Our classifi er uses diff erent thresholds for each of these parameters, and these thresholds are automatically learned from the taxonomic structure of the reference database.

Results
We have randomly simulated about 300,000 DNA sequences of 60 bp and about 70,000 DNA sequences of 300 bp from phylogenetic marker genes. Table 1 shows the performance of the phylogenetic classifi cations from MetaPhyler, PhymmBL [2], MEGAN [3] and WebCARMA [4]. The query sequence itself was removed from the reference dataset when running the programs. The sensitivity of MetaPhyler is signifi cantly higher than that of the other tools in all situations because our classifi er is explicitly trained at each taxonomic rank.
In addition, we have created a simulated metagenomic sample comprising fi ve genomes. Table 2 shows the taxonomic profi les estimated by diff erent approaches. In this setting, MetaPhyler also outperforms the other approaches by more accurately reconstructing the true taxonomic distribution.
Conclusions We have introduced a novel taxonomic classifi cation method for analyzing the microbial diversity from whole metagenome shotgun sequences. Compared with previous approaches, MetaPhyler is more Results We identifi ed a family with a previously undescribed lethal X-linked disorder of infancy comprising a distinct combination of an aged appearance, craniofacial anomalies, hypotonia, global developmental delays, cryptorchidism, cardiac arrhythmia and cardiomyopathy. We used X-chromosome exon sequencing and a recently developed probabilistic disease-gene discovery algorithm to identify a missense variant in NAA10, which encodes the catalytic subunit of the major human amino-terminal acetyltransferase (NAT; also known as hNaa10p). More recently, we became aware that a parallel eff ort on a second unrelated family converged on the same variant. The absence of this variant in controls, the amino acid conservation of this region of the protein, the predicted disruptive change and the co-occurrence in two unrelated families with the same rare disorder suggest that this is the pathogenic mutation. We confi rmed this by demonstrating that the mutant hNaa10p had signifi cantly impaired biochemical activity, and we therefore conclude that a reduction in acetylation by hNaa10p causes this disease.
Conclusions This is one of the fi rst uses of next-generation sequencing to identify the genetic basis of a previously unrecognized X-linked syndrome. It is also the fi rst evidence of a human genetic disorder resulting from direct impairment of amino-terminal acetylation, one of the most common protein modifi cations in humans. We have also demonstrated that a probabilistic disease-gene discovery algorithm (VAAST) can readily identify and characterize the genetic basis of this syndrome.

P14
Abstract not submitted for online publication. Background Genome-wide association studies (GWAS) have identifi ed a single nucleotide polymorphism, rs2294008 C/T, within the prostate stem cell antigen (PSCA) gene as a risk variant for bladder cancer [1]. PSCA is a glycosyl phosphatidylinositol (GPI)-anchored cell surface protein from the Ly-6/Thy-1 family of cell surface antigens. PSCA overexpression has been reported in bladder, prostate and pancreatic tumors. The risk allele (T) of rs2294008 creates a novel translation start site and extends the PSCA leader peptide sequence by 11 amino acids. Methods The mRNA expression in 42 bladder tumor samples and 39 adjacent normal bladder tissue samples (24 matched normal-tumor pairs) was explored using genome-wide RNA sequencing and targeted PSCA mRNA expression assays. For allelic expression imbalance studies, genotyping of rs2294008 both in DNA and cDNA samples was performed using an allelic discrimination genotyping assay. Alternative allele-specifi c splicing forms of PSCA were cloned and transfected into several human cancer cell lines. The endogenous expression of PSCA protein and the expression pattern of the recombinant PSCA allelic isoforms in diff erent cancer cell lines were studied by western blotting, confocal microscopy and fl uorescence-activated cellsorting analysis. PSCA protein expression in normal and tumor bladder tissue samples was examined in relation to rs2294008 genotypes by using immunohistochemistry.
Results PSCA mRNA was expressed at a 5.7-fold higher level in tumors than in matching normal bladder tissue samples (P = 0.0060). There was a strong allelic expression imbalance in tumor samples (P = 0.0020), based on 20 normal and 13 tumor samples that were heterozygous for rs2294008. PSCA mRNA expression was associated with the genotype of rs2294008 both in normal and tumor bladder tissue samples. Our preliminary data on the expression of recombinant allele-specifi c PSCA protein isoforms in transfected cells show a possible diff erence in the distribution of the cytoplasmic and membrane expression of these isoforms.

Conclusions
Our results suggest that the extension of the PSCA leader peptide by 11 amino acids, introduced by the risk allele (T) of rs2294008, may aff ect subcellular protein localization and the availability of functional GPI-anchored PSCA on the cell surface. These results may have clinical implications because antibodies that target cell-surface-expressed PSCA are in clinical trials for pancreatic and prostate cancer. . Surprisingly, all of the mutants, including 8-Δ, were viable and could withstand redox stresses; however, they were unable to activate or repress transcriptional events in response to hydrogen peroxide treatment, which was most evident in the 8-Δ mutant. In our work, network analysis was used to gain a better understanding of the biological networks whose gene expression is aff ected by these mutations. Methods Microarray data (provi ded by [1]) was processed for input into the Cytoscape plug-in jActiveModules. Active sub-networks for select mutants were identifi ed using all yeast interactions found in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [2] as the background network (including protein-protein, metabolic and gene expression interactions). Nodes in each sub-network were input into the Database for Annotation, Visualization and Integrated Discovery (DAVID) [3] to identify which KEGG pathways were present. Results Two hundred and six genes appeared in one or more of the active sub-networks. Only seven genes were present in the sub-networks of all strains. These were a known oxidative stress-induced aldose reductase (GRE3), four putative aryl-alcohol dehydrogenases (AAD3, AAD6, AAD10 and AAD14), a mitochondrial aldehyde dehydrogenase (ALD4) and a xylulokinase (XKS1). All of the genes were upregulated on average by 6-to 12-fold in all strains, except for 8-Δ with a 1.5-fold average upregulation and 5Prx-Δ with a 3-fold average upregulation. Many metabolic pathways were aff ected by the knockouts; the pathway types aff ected depended on which peroxidase gene was knocked out. This result suggests that diff erent thiol peroxidases may have a signifi cant and specifi c impact on the regulation of metabolic pathways during oxidative stress. Surprisingly, the Gpx3-Δ active sub-network was similar to the Gpx1-Δ and Gpx2-Δ sub-networks. Gpx3 is known to sense hydrogen peroxide and pass that signal along to transcription factors; thus, it was expected that this subnetwork would diff er from that of the other Gpx mutants. Additionally, our results showed that amino acid metabolism, biosynthesis and degradation pathways were active in wild-type cells but were present in few mutant strains.

Conclusions
The results of this work indicate that thiol peroxidases, along with playing a key role in maintaining redox homeostasis, may also play a signifi cant role in the regulation of metabolic pathways in yeast, thus illuminating the global role that thiol peroxidases play in oxidative stress. Here, we present major improvements to the Metastats software and the underlying statistical methods. First, we describe new approaches for data normalization that allow a more accurate assessment of diff erential abundance by reducing the covariance between individual features implicitly introduced by the traditionally used ratio-based normalization. These normalization techniques are also of interest for time-series analyses or in the estimation of microbial networks. A second extension of Metastats is a mixed-model zero-infl ated Gaussian distribution that allows Metastats to account for a common characteristic of metagenomic data: the presence of many features with zero counts owing to undersampling of the community. The number of 'missing features' (zero counts) correlates with the amount of sequencing performed, thereby biasing abundance measurements and the diff erential abundance statistics derived from them. Using simulated and real data, we show that these methods signifi cantly improve the accuracy of Metastats. We also describe the addition of several new statistical tests to our code (including presence/absence and the corresponding odds ratio, and penetrance calculations) that improve the usability of our software in clinical practice. Background A recent genome-wide association study (GWAS) of bladder cancer identifi ed a single nucleotide polymorphism (SNP), rs11892031, within the UGT1A gene cluster on chromosome 2q37.1, as a novel risk factor. The UGT1A locus encodes nine UGT proteins, which belong to the phase II cellular detoxifi cation system. UGTs are functionally important for the detoxifi cation of aromatic amines, which are found in industrial chemicals and tobacco smoke and are known risk factors for bladder cancer. The UGTencoding genes have exons 2 to 5 in common but have diff erent fi rst exons, which defi ne the enzymatic activity and substrate specifi city of the gene products.

Methods and results
We sequenced all nine highly similar alternative fi rst exons for the UGT-encoding genes of up to 2,000 individuals. We identifi ed 26 known nonsynonymous and 17 known synonymous coding variants but no novel variants. Imputation based on the GWAS dataset, a combined reference panel of HapMap 3 and the 1000 Genomes Project, and a subset of GWAS samples genotyped for all of the identifi ed coding variants generated data for 1,170 SNPs within the whole UGT1A region. Of these markers, the strongest association was detected for an uncommon protective genetic variant that explained the original GWAS signal (odds ratio (OR) = 0.55, 95% confi dence interval (CI) = 0.44 to 0.69, P = 3.3 × 10 −7 in 4,035 cases and 5, 284 controls; D' = 0.96, r 2 =0.23 with rs11892031). No residual association in this region was detected after adjustment for this SNP. A typical genetic variant identifi ed by GWAS for a common disease is expected to be a common allele (>10% minor allele frequency) that increases the disease risk. We show that the novel associated variant is an uncommon protective allele (1.14% in cases and 2.5% in controls). Interestingly, the risk allele (G) is conserved in 33 species, whereas the protective allele (T) is a human-specifi c variant. Even though this SNP is a synonymous coding variant, we show its association with quantitative mRNA expression of a specifi c functional splicing form of UGT1A6, probably through an exonic splicing enhancer. Conclusions This study exemplifi es that uncommon protective genetic variants are unusual suspects that may play important but underestimated functional roles in complex traits. Background Horizontal gene transfers (HGTs) are pervasive in prokaryotes [1], being the routes of net-like evolution that collectively dominate the evolution of prokaryotes [2]. However, in eukaryotes, the eff ect of HGT has not been thoroughly analyzed, with the exception of the massive HGT from the endosymbionts [3]. Here, we report a comprehensive analysis of likely HGT events in diff erent groups of unikonts (Amoebozoa, Archamoebae, Mycetozoa, the Fungi/Metazoa group, Choanofl agellida, Fungi and Metazoa). Methods We analyzed the complete proteomes of 36 species of unikonts: 1 from the Archamoebae, 1 from Mycetozoa, 18 from Fungi, 13 from Metazoa and 1 from Choanofl agellida. These proteomes were manually selected to widely represent the unikont supergroup. Initial pre-candidate genes were obtained by analyzing each proteome using the DarkHorse program [4]. The program BLASTClust was then used to make clusters of putative unique transfer events at the origin of the diff erent groups of unikonts. These clusters were separated into two groups: group I candidate clusters (clusters with no eukaryotic representative other than the unikont group analyzed), and group II candidate clusters (clusters with representatives from prokaryotes, the unikont group analyzed and other eukaryotes). Sequences from group I candidate clusters were analyzed using BLAST versus nr and RefSeq databases, compared with the clusters of orthologous groups for eukaryotic complete genomes (KOGs) [5] and manually curated to remove false positives that result from bacterial contamination of the genomic DNA. Group II candidate clusters were analyzed using a series of automatic, conservative fi lters to assess the quality of the candidates. Finally, all clusters were phylogenetically analyzed to defi ne the fi nal candidates and to infer putative donors. Results Using this methodology, we detected numerous probable HGT events from prokaryotes (mainly Bacteria) to unikonts. These events are not distributed uniformly throughout the evolution of unikonts: for example, almost all HGTs detected in Amoebozoa occurred after the divergence of Archamoebae and Mycetozoa. Importantly, we also detected many HGT events from Bacteria to Fungi, Choanofl agellida and MetazoaConclusions Although HGTs are not as pervasive in eukaryotes as in prokaryotes, the amount of HGT detected in this study suggests that the acquisition of genes from Bacteria played a major role in the evolution of the unikonts. Background Most studies exploring cancer progression have focused on the infl uence of individual genes, and few eff orts have investigated the eff ects of interactions between genes within the genome. Our hypothesis is that cancer cells thrive by exploiting combinations of genes, in fact by exploiting networks of genes that both protect the cell against destruction and enhance its survival. We believe that these networks involve genes that tend to be coordinated in their copy number alterations, even when they are located at a distance in the genome. Radiation hybrid (RH) cells have a random assortment of genes as triploid rather than diploid. Our recent work studying genetic networks in libraries of RH cells has elucidated key survival-enhancing interactions with high specifi city [1]. Because of the hardiness of the RH clones, statistically signifi cant patterns of co-inherited, unlinked triploid gene pairs pointed to the cell survival mechanism. We identifi ed more than 7.2 million signifi cant interactions at single-gene resolution using the RH data. Methods Our work with the RH data provided the rationale for an investigation of cancer survival networks, in particular for glioblastoma multiforme, a formidable brain cancer for which extensive datasets are available but few treatment options. We investigated correlated patterns of copy number alterations for distant genes in glioblastoma multiforme tumors using the same method we employed to construct the RH survival network. Public data were analyzed from 301 glioblastomas that had been assessed for copy number alterations using array comparative genomic hybridization [2]. Results The glioblastoma and RH survival networks overlapped signifi cantly (P = 3.7 × 10 −31 ). We therefore exploited the high-resolution mapping of the RH data to obtain single-gene specifi city in the glioblastoma network. The combined network features 5,439 genes and 13,846 interactions (false discovery rate (FDR) <5%) and suggests novel approaches to therapy for glioblastoma. For example, although the epidermal growth-factor receptor (EGFR) oncogene is frequently activated in glioblastoma, EGFR inhibitors have limited therapeutic effi cacy [3]. In the combined glioblastoma survival network, there are 46 genes that interact with EGFR, of which ten (22%) happen to be targets of existing drugs. This observation suggests that a fl anking attack strategy that strikes at both EGFR and its partner genes in the glioblastoma survival network may be an eff ective approach to treating these tumors. Conclusions By elucidating a genetic survival network for glioblastoma, we gained insight into the mechanisms of proliferation of this cancer and opened up new avenues for therapeutic intervention. Background Hundreds of diverse genetic loci have been linked to autism spectrum disorders (ASDs), making large-scale analysis essential for understanding the molecular events underlying the pathogenesis of these disorders. Our laboratory fi rst released the autism database AutDB in 2007 as a bioinformatics tool for systematic curation of all known ASD candidate genes [1][2][3]. AutDB was designed with a systems biology approach, integrating genetic entries within the Human Gene module with corresponding behavioral, anatomical and physiological data in the Animal Model module. In June 2011, we released a new Protein Interaction (PIN) module of AutDB, which serves as a comprehensive, up-to-date resource on the direct protein interactions of ASD-linked genes.
Methods To curate the PIN module, our researchers utilize a multi-level annotation model to systematically search, collect and extract information entirely from published, peer-reviewed scientifi c literature. Although we initially consult public molecular interaction databases (HPRD and BioGRID) and commercial molecular interaction software (Pathway Studio, version 7.1), every interaction is manually extracted and verifi ed by evaluating the primary reference articles from PubMed. Our manual curation has proved critical for accurate annotation, because these references were the second largest source of references for the initial PIN dataset, providing more interactions than both HPRD and Pathway Studio. Each ASD gene entry within the PIN module is presented as a multi-level display, with interactive graphical and tabular views of its corresponding interactome.

Results
The initial PIN dataset includes interactomes for 86 ASD candidate genes, with a total of 1,311 direct protein interactions garnered from 533 unique primary references. These interactomes are composed of 6 interaction types and 13 species, documented by 402 distinct pieces of evidence. Our researchers will expand and maintain the data content of the PIN module with systematic updates. Conclusions We have created an integrated bioinformatics tool that can be used for the large-scale analysis of the biological relationships among ASD candidate genes. Such network analysis is envisioned to provide a framework for identifying the key molecular pathways underlying ASD pathogenesis, potentially leading to the development of novel drug therapies. Background Bladder cancer is the 9th most common cancer worldwide and the 13th most common cancer-related cause of death. Bladder cancer frequently recurs after the removal of primary carcinomas. This recurrence leads to repeated surgeries and long-term treatment and surveillance, making it the most expensive type of cancer to treat. Genetic factors and environmental factors such as cigarette smoking and occupational exposure to aromatic amines are linked to bladder cancer risk. Genomewide association studies (GWAS) for bladder cancer have identifi ed multiple genetic variants within genes and regions, including TP63, TERT-CLPTMIL and 8q24.21, to be highly associated with disease risk. Whole transcriptome sequencing (RNA-Seq) is a revolutionary tool for generating a large amount of qualitative and quantitative information, thus helping to explore known and novel transcripts, splicing forms and fusion genes. Methods To understand the genetic and genomic landscape of the GWAS susceptibility regions, we investigated and characterized the entire transcriptome of normal and tumor bladder tissue samples by using powerful massively parallel RNA sequencing. We used an Illumina HiSeq 2000 instrument to sequence six paired samples of normal and tumor bladder tissues. For each of the samples, we generated 50 Gb of 100-bp reads to represent the whole transcriptome. Results Using the Bowtie/TopHat and Samtools packages, we successfully aligned approximately 80% of the total sequence reads against the human genome reference sequence (build 19). Our analysis sought to identify alternative splicing forms, novel exons, non-coding transcripts and chimeric fusion events. Total levels of mRNA in normal and tumor samples were evaluated by Cuffl inks analysis based on the Ensembl transcripts database. Multiple splicing isoforms were identifi ed for some of the GWAS susceptibility genes, and some of these isoforms were diff erentially expressed between the tumor and normal samples. We found that novel transcripts and non-coding RNAs corresponding to gene desert regions such as 8q24 were abundantly expressed. Our next step will focus on validation of these diff erentially expressed genes and novel transcripts by using quantitative RT-PCR on independent samples. Conclusions Using RNA-Seq, we explored transcripts corresponding to candidate regions identifi ed by bladder cancer GWAS. Some of these transcripts demonstrated splicing variability and diff erential levels of expression between normal and tumor tissue samples, which might be of importance for bladder cancer. Background Recent genome-wide association studies (GWAS) have identifi ed multiple genetic variants associated with the risk of developing prostate cancer (PrCa). At least ten PrCa-associated single nucleotide polymorphisms (SNPs) are located within a gene-poor region on chromosome 8q24, but the functional mechanisms of each of these variants remain unknown. Normal prostate development, as well as tumor initiation and progression, greatly depends on the androgen receptor (AR) and its ligands, testosterone and 5α-dihydrotestosterone. We hypothesized that genetic variants associated with PrCa risk might be important owing to their eff ects on AR-binding sites.

Methods and results
We comprehensively explored 11 PrCa GWAS published as of July 2011 in the National Human Genome Research Institute's GWAS database [1] and in PubMed [2]. We selected ten SNPs from the 8q24 region that were signifi cantly and consistently associated with PrCa in Caucasian datasets (P < 5 × 10 −7 ). By querying the CEU 1000 Genomes Project panel, we generated a list of 224 SNPs in high linkage disequilibrium (r 2 > 0.8) with the ten selected GWAS SNPs. Of all of the SNPs on this list, six variants were located in the regions identifi ed as AR-binding sites, based on AR chromatin immunoprecipitation (ChIP)-Seq data from the University of California, Santa Cruz's genome browser [3]. To test for diff erential binding of AR to alleles of the six SNPs, we developed a protocol for quantitative multiplex allele-specifi c ChIP (AS-ChIP) assays. Confi rmatory AS-ChIP with AR-specifi c antibodies in the LNCaP cell line showed that fi ve of these SNPs were heterozygous in the LNCaP cell line, and four of them showed statistically signifi cant allele-specifi c diff erences in AR binding (P-value range = 0.0005 to 0.04, based on four biological replicates of AS-ChIP). Background Metagenomics has opened the door to unprecedented comparative and ecological studies of microbial communities, ranging from the sea [1] to the soil (the terragenome) to within the human body [2,3]. Most analyses begin with assembly, as the short reads that are characteristic of most datasets severely limit the ability to classify the data taxonomically [4][5][6][7] and require considerable computational resources to perform comparative analyses (such as BLAST against public databases). In addition, given that many sequences are likely to be from novel organisms, classifi cation methods relying on databases fail to acknowledge most of the novel species present in the dataset. In an attempt to move away from reference-based analysis, computational tools based on promising algorithmic and statistical methods for metagenomic de novo assembly have recently started to emerge [8,9]. However, to date, they either are ill-suited to large datasets or have yet to off er signifi cant improvements over existing genome assemblers that were not designed for metagenomic assembly. Methods Here, we describe MetAMOS [10], an open-source, modular assembly pipeline built upon AMOS and tailored specifi cally for metagenomic next-generation sequencing data. MetAMOS is the fi rst step toward a fully automated assembly and analysis pipeline, from mated reads (Illumina and 454) to scaff olds and ORFs. Currently, MetAMOS has support for four assemblers (SOAPdenovo [11], Newbler, CABOG and Minimus [12]), three annotation methods (BLAST, PhymmBL and MetaPhyler), two metagenomic gene prediction tools (MetaGeneMark and Glimmer-MG) and one unitig scaff older engineered specifi cally for metagenomic data (Bambus 2). We also provide a novel graph-based algorithm to propagate annotations rapidly to all contigs in an assembly using, for example, only the largest contigs or contigs with high-confi dence classifi cation. MetAMOS has three principal outputs: subdirectories containing FASTA sequence of the contigs/scaff olds/ variant motifs belonging to a specifi ed taxonomic level, a collection of all unclassifi ed/potentially novel contigs contained in the assembly, and an HTML report with detailed assembly statistics and summary charts.

Results and conclusions
We compared MetAMOS with other metagenomic assembly tools (Meta-IDBA and Genovo) and with genome assemblers that have previously been used with metagenomic data (CA-met and SOAPdenovo). We used both a mock/artifi cial dataset generated for the Human Microbiome Project (HMP) project and real metagenomic samples from the HMP and its European counterpart (MetaHIT). On the mock dataset, MetAMOS compares favorably to existing metagenomic and genomic assemblers with respect to several validation metrics that take into account contig accuracy in addition to size. On the real dataset, MetAMOS also outperforms the existing software. These improvements can largely be attributed to heavy reliance on Bambus 2 and to assembly verifi cation techniques that help identify and remove potentially chimeric contigs while running the pipeline.
In terms of biology, we were able to report several novel variant motifs that would be challenging at best to identify and extract from the output of other methods. In addition, much emphasis was placed on making MetAMOS compatible with a variety of next-generation sequencing technologies, genome assemblers and annotation methods, making the pipeline highly customizable for the beginner and advanced bioinformatics user alike. Gaucher disease is the most common lysosomal storage disorder. It results from an inherited defi ciency of the enzyme glucocerebrosidase (GBA); accumulation of the substrate of this enzyme has many clinical manifestations.

The mutation spectrum in Indian patients with Gaucher disease
Since the discovery of the GBA gene, more than 200 mutations have been identifi ed, but only a handful of mutations are recurrent (L444P, N370S, IVS2, D409H and 55Del). To determine the spectrum of mutations in the Indian population, we performed mutational screening in children with Gaucher disease. Twenty-four patients from twenty families were enrolled in this study, after written informed consent was obtained. The diagnosis of Gaucher disease was based on mandatory clinical and biochemical analysis. An initial screening for fi ve common mutations was carried out using PCR-RFLP. Patients who were negative for common mutations were screened by sequencing exons 9 to 11 (a mutation hotspot region) [1]. We identifi ed common mutations (L444P, N370S, IVS2 and D409H [2], and 55Del [3]) in approximately 50% of the patients. L444P (c.1448T>C) was the most frequently identifi ed, followed by D409H in our patients. Western data shows that N370S is the most common mutation in Romanian patients [4]. One polymorphism (E340K) was identifi ed in two patients who were compound heterozygotes for A456P/R463C and S237F/A269P, respectively. Our data highlight the spectrum of mutations that lead to Gaucher disease in the Indian population.
Background Given diff erential gene expression data across divergent mutant strain arrays of two enzyme subgroups, it would be logical to segregate by protein group ablation (PGA). Discrete correlate summation (DCΣ) was utilized to examine the diff erential eff ects of a hydrogen peroxide stressor on discrete and total yeast knockouts of the genes encoding glutathione peroxidase (Gpx) and peroxiredoxin (Prx), both groups starting from the wild-type (WT) strain [1]. While the half-life of the total Gpx knockout mutant is intermediate between that of the WT and the transient total Prx knockout mutant, the distribution of passage number of the various mutant strains can be separated into two groups independent of Gpx and Prx state. Based on half-viability, totalPrx <<<< nPrx << Gpx3 = Tsa1 < totalGpx < mPrx <<< Gpx1 < Gpx2 << Ahp1 = WT <<< Tsa2 (P < 0.0005, two tailed t-test, n = 5, 6). DCΣ was also employed for the boundary between robust and gracile cultures. The aim of this study was to fi nd the characteristic response of the transcriptome, from the perspective of PGA versus strain viability (SV). Methods DCΣ is a method used to score variables that can be classifi ed into two groups [2]. It is a composite score of a gene's mean group change and overall interaction diff erence relative to all others tested. Transcripts were included in this analysis only if the values for all conditions passed microarray quality control and were present in the Kyoto Encyclopedia of Genes and Genomes (KEGG) network [3]. Randomly sorted edges were sampled for comparison (P < 0.001, two tailed t-test, n = 8,372). Edges that were sorted on average DCΣ score and grouped by biological process yielded a distinctive topology (P < 1e-85, two tailed t-test, n = 8,372). The identifi ed transcripts were subjected to functional annotation in the Database for Annotation, Visualization and Integrated Discovery (DAVID) [4].
Results Application of DCΣ to the individual and complete knockouts of Gpx (3 genes) and Prx (5 genes) identifi ed 92 transcripts based on PGA and 43 based on SV, with a 13 gene overlap (corresponding to the proteins Arg1p, Aah1p, Ade17p, Pgm2p, Cat2p, Cdd1p, Mae1p, Arg3p, Nma2p, Ole1p, Cta1p, Spb1p and Cds1p). Functional annotation analysis of the 92 PGA transcripts identifi ed the following functions: pyrimidine metabolism, steroid biosynthesis, purine metabolism, RNA polymerase and terpenoid backbone biosynthesis. Ergosterol biosynthesis, gluconeogenesis and transcription from Pol I/III promoters were major biological process categories for this set. Interestingly, terpenoids feed into the steroid pathway, which results in the vitamin D2 precursor ergosterol. Analysis of the 43 SV transcripts identifi ed starch and sucrose metabolism, butanoate metabolism, and fructose and mannose metabolism. Stress response was the key biological process for this arm of the study. No functional annotations were statistically signifi cant for the common genes. Transcripts identifi ed by PGA of either the Gpx-or Prxencoding genes tend toward transcriptional control mechanisms, whereas SV-associated transcripts track with metabolic necessities.  of a disease or on the underlying mechanisms. Many studies have shown that variations in gene expression among individuals, as well as among cell types, contribute to phenotype diversity and disease susceptibility. Recent genome-wide expression quantitative trait loci (eQTL) association (GWEA) studies have provided information on genetic factors, especially SNPs, that are associated with gene expression variation. These expression-associated SNPs (exSNPs) have already been utilized to explain some results of GWAS for diseases, but interpretation of the data is handicapped by low reproducibility of the genotype-expression relationships. Methods To address this problem, we established several gold standard sets of high-reliability exSNPs based on multiple occurrences in diff erent GWEA studies in various human populations and cell types. We then related these data to results from GWAS for diseases, to fi nd a set of disease-associated loci that are likely to have an underlying expression mechanism. HapMap linkage disequilibrium data were utilized to allow the comparison of GWEA results from studies that employed diff erent microarray SNP sets.
Results We integrated the current gold standard data with SNPs in diseaseassociated loci from the Wellcome Trust Case-Control Consortium (WTCCC) GWAS of seven common human diseases. Approximately one-third of these disease-associated loci in the WTCCC GWAS were found to be consistent with an underlying expression change mechanism. Comparing separate gold standard sets for Caucasian (CEU), African (YRI) and Asian (ASN) populations also allowed us to investigate which exSNPs contribute to population-specifi c eQTLs.
Conclusions Use of the gold standard set of SNP-expression relationships has enabled us to more reliably determine the role of expression changes in common human diseases. Eukarya-specifi c r-proteins [1]. Despite the high sequence conservation of r-proteins, the annotation of r-protein genes is often diffi cult because of their short lengths and biased sequence composition. Methods To perform a comprehensive survey of prokaryotic r-proteins, we developed an automated computational pipeline for the identifi cation of r-protein genes and applied it to 995 completely sequenced bacterial genomes and 87 archaeal genomes available in the RefSeq database. The pipeline employs curated seed alignments of r-proteins to run position-specifi c scoring matrix (PSSM)-based BLAST searches against six-frame genome translations, thus overcoming possible gene annotation errors. Likely false positives are identifi ed using comparisons against the original seed alignments.

Results
In the course of this analysis, we gained insight into the diversity of prokaryotic r-protein complements, such as missing and paralogous r-proteins and distributions of r-protein genes among chromosomal partitions. A phylogenetic tree was constructed from a concatenated alignment of 50 almost-ubiquitous bacterial r-proteins. The topology of the tree is generally compatible with the current high-level bacterial taxonomy, although we detected several inconsistencies, possibly indicating uncertain or erroneous classifi cation of the respective bacteria. Similarly, a concatenated alignment of 57 ubiquitous archaeal proteins was used for an archaeal phylogenetic tree reconstruction. In both Bacteria and Archaea, the patterns of the presence/absence of non-ubiquitous r-proteins suggest several independent losses and/or gains of these proteins. According to parsimony reconstruction, three bacterial and fi ve archaeal r-proteins do not appear to be ancestral. Remarkably, all fi ve non-ancestral archaeal r-proteins are present in Eukarya.
Conclusions Extended sets of prokaryotic r-proteins were created. Alignments of these sets may be used as new seed profi les for the identifi cation of r-proteins in new genomes and for comparative genomics studies. Broad clinical application of ultra-high-throughput sequencing is imminent. In a few notable cases, actionable information has been discovered from sequencing, and the number of such cases is likely to increase. At present, there are no widely accepted genomic standards or quantitative performance metrics. These are needed to achieve the confi dence in measurement results that is expected for sound, reproducible research and regulated applications. The National Institute of Standards and Technology (NIST) has been approached about considering development in this area by several commercial entities and regulatory agencies. There is great enthusiasm for translation of sequencing from the research community to clinical practice, and standards that can be used to inform confi dence in measurement results (for instance, through validation studies, profi ciency testing and routine quality assurance) may be an enabling factor in that goal. NIST is currently gathering input from the genomics community about which reference materials and data would be useful. For example, NIST and the Coriell Institute for Medical Research may develop genomic reference material from cell lines from families that have already been characterized by a variety of sequencing methods (for example, the cell line from which NA12878 DNA is derived). In addition, we may build synthetic DNA constructs to test specifi c questions about measuring diff erent types of variants or combinations of variants in diff erent genomic contexts. For example, we might create pairs of constructs with single nucleotide polymorphisms, indels and/or structural variants in GC-or AT-rich regions or repeat regions.
To ensure the design of appropriate standards, we are interested in discussing the design and application of genomic reference materials with any interested parties. Background Protein-protein interactions (PPIs) are the most fundamental biological processes at the molecular level. The experimental methods for testing PPIs are time-consuming and are limited by analogs for many reactions. As a result, a computational model is necessary to predict PPIs and to explore the consequences of signal alterations in biological pathways. Reproductive control of the vector Anopheles gambiae using transgenic techniques poses a serious challenge. To meet this challenge, it would help to defi ne the biological network involving the male accessory gland (MAG) proteins responsible for successful formation of the mating plug [1]. This plug forms in the male and is transferred to the female during mating, hence initiating the PPIs in both sexes. As is the case in Drosophila melanogaster, a close relative of A. gambiae, some MAG proteins responsible for the formation of the mating plug have been shown to alter the post-mating behavior of females.

Methods and results
The STRING database for known PPIs was used to identify orthologs of A. gambiae proteins in Drosophila (Table 1). Twentyseven proteins are known to form the mating plug in A. gambiae, and 16 others were obtained as strings in the STRING database. Chromosome synteny comparisons for proteins with more than 50% identity between species were carried out using the Artemis Comparison Tool (  they are upregulated in the reproductive tissues of both sexes. To understand the processes involved in plug formation, the Reactome database was used, and the hub proteins were identifi ed in 49 of the 2,021 known processes in Drosophila. Twelve proteins were involved in the following processes: metabolism of proteins (8.8e-13), gene expression (2.0e-06), 3'-UTR-mediated translational regulation (7.7e-08), regulation of β-cell development (1.3e-06), diabetes pathways (6.8e-06), signal recognition (preprolactin) (5.0e-07) and membrane traffi cking (1.3e-03). Of the top 50 proteins, 92% had orthologs in A. gambiae, with one identifi ed in the mating plug and four others identifi ed as strings to AGAP009584, which is found in the mating plug. Acp29AB was identifi ed in the network and is known to induce post-mating responses in Drosophila, confi rming that the network is reproductive and giving an insight into the possible pathways involved. The CG9083 (Q8SX59) protein was ranked fi rst among the hub proteins but has no ortholog in A. gambiae. Interestingly, it has the same protein properties as the Plugin protein (AGAP009368) in A. gambiae, suggesting that Plugin may be the main protein in the PPI reproductive network in A. gambiae. The Whelan and Goldman (WAG) maximum likelihood tree evaluations of the plug proteins in A. gambiae and their orthologs in Drosophila showed that these proteins are involved in similar biological processes in both species, but the A. gambiae protein evaluation provided a better explanation for the expected process as it clustered in both pre-mated and post-mated PPIs. This DNA sequence motifs with the ability to form non-B (non-canonical) structures have been linked to a variety of regulatory and pathological processes. Although the exact mechanism is unknown, recent work has provided signifi cant evidence that non-B DNA structures may play a role in DNA instability and mutagenesis, leading to both DNA rearrangements and increased mutational rates, which are hallmarks of cancer. We have developed algorithms to identify a wide variety of non-B-DNA-forming motifs, including G-quadruplex-forming repeats, direct repeats and slipped motifs, inverted repeats and cruciform motifs, mirror repeats and triplex motifs, and A-phased repeats. After identifying these motifs in the mammalian reference genomes of human, mouse, chimpanzee, macaque, cow, dog, rat and platypus, the data were made publicly available in non-B DB [1]. However, it soon became apparent that it was not feasible to annotate the ever-growing list of genomic data and that it would be more eff ective to provide researchers with a systematic tool to predict these motifs in their own genomic data. Thus, the non-B DNA Motif Search Tool (nBMST) was created, and it is freely available online [2]. nBMST is a web interface that enables researchers to interactively submit any DNA sequence for searching for non-B DNA motifs. Once a user submits one or more DNA sequences in FASTA format, nBMST returns a comprehensive results page that contains the following: downloadable fi les in both a tab-delimited format and a generic feature format (GFF); a visualization, including PNG images; and a dynamic genome browser created using the Generic Genome Browser (GBrowse) [3] (version 2.0). Currently, nBMST allows fi le sizes of up to 20 MB of DNA sequence to be uploaded and stores the results for registered users for up to six months. In summary, the purpose of nBMST is to help provide insight into the involvement of alternative DNA conformations in cancer and other diseases, as well as into other potential biological functions.
to date, data generated from GWAS have not been maximally leveraged and integrated with gene expression data to identify the genes and pathways associated with the most aggressive subset of breast cancers, triple-negative breast cancer (TNBC), which accounts for about 20% of all breast cancers. TNBC disproportionately aff ects young premenopausal women and has a higher mortality rate among African-American women. At present, no targeted treatments exist for TNBC, and standard chemotherapy remains the only therapeutic option. Integration of genetic mapping results from GWAS with gene expression data could lead to a better understanding of the genetic mechanisms underlying the molecular basis of the TNBC phenotype and to the identifi cation of potential biomarkers for the development of novel therapeutic strategies. Methods We mined data from 43 GWAS involving over 250,000 patients with breast cancer and 250,000 controls, reported through April 2011, to identify genetic variants (single nucleotide polymorphisms (SNPs)) and genes associated with risk for breast cancer. We then integrated GWAS information with gene expression data from 305 subjects (162 cases and 143 controls) to stratify TNBC and other breast cancer subtypes, as well as to identify functionally related genes and multi-gene pathways enriched by SNPs that are associated with risk for breast cancer and are relevant to TNBC. To stratify TNBC and to identify functionally related genes, we performed supervised and unsupervised analysis of gene expression data. We used a false discovery rate to correct for multiple testing. Pathway prediction and networking visualization was performed using Ingenuity Systems' software.
Results Combining GWAS information with gene expression data, we identifi ed 448 functionally related genes that stratifi ed breast cancer subtypes into TNBC. A subset of these genes (130 genes) contained SNPs associated with risk for breast cancer; of these 130 genes, 122 correctly stratifi ed TNBC. Pathway prediction revealed multi-gene pathways enriched by SNPs that are signifi cantly associated with risk for breast cancer. Key pathways identifi ed include the p53, nuclear factor-κB, DNA repair and cell cycle regulation pathways.
Conclusions Our results demonstrate that integrating GWAS information with gene expression data can be an eff ective approach for identifying biological pathways that are relevant to TNBC. These could be potential targets for the development of novel therapeutic strategies.

P36
Abstract not submitted for online publication. The clinical reality of the post-genomic era is that we now face even more complex disease processes when provided with genomic information, including multifactorial genetic and genomic infl uences, and epigenetic and environmental factors. A useful example of the promise and perils of genomic technologies and information is breast cancer. By the mid-1990s, two genes (BRCA1 and BRCA2) had been identifi ed, accounting for approximately 5% of aff ected individuals. Since then, surprisingly few genetic breast cancer risk factors have been identifi ed to account for the remaining 95%. To effi ciently and cost-eff ectively identify individuals at high risk, a combination of information components is required: a patient-reported personal and family medical history; clinical data (for example, a physical exam, pathology results, laboratory test results and imaging); and genetic/genomic results. Gaining comprehensive data from all of these areas provides the best risk assessment and management options for patients. Furthermore, high quality patient and clinical information is essential for the accurate and reliable interpretation of genomic results. We have clinically implemented a platform that integrates all three informational components with multiple risk estimation models (REMs) to produce an eff ective automated method for risk-stratifying patients. Although this platform can be and has been applied to a wide range of genetic conditions, this presentation will use breast cancer to illustrate the approach. This system consists of three primary components: a secure The new and emerging fi eld of systems medicine, an application of systems biology approaches to biomedical problems in the clinical setting, leverages complex computational tools and high dimensional data to derive personalized assessments of disease risk. Systems medicine off ers the potential for more eff ective individualized diagnosis, prognosis and treatment options. The Georgetown Clinical & Omics Development Engine (G-CODE) is a generic and fl exible web-based platform that serves to allow basic, translational and clinical research activities by integrating patient characteristics and clinical outcome data with a variety of high-throughput research data in a unifi ed environment to enable systems medicine. Through this modular, extensible and fl exible infrastructure, we can quickly and easily assemble new translational web applications with both analytic and generic administrative features. New analytic functionalities specifi c to the needs of a particular disease community can easily be added within this modular architecture. With G-CODE, we hope to help enable the creation of new disease-centric portals, as well as the widespread use of biomedical informatics tools by basic, clinical and translational researchers, through providing powerful analytic tools and capabilities within easy-to-use interfaces that can be customized to the needs of each research community. This infrastructure was fi rst deployed in the form of the Georgetown Database of Cancer (G-DOC) [1], which includes a broad collection of bioinformatics and systems biology tools for analysis and visualization of four major omics types: DNA, mRNA, microRNA and metabolites. Although several rich data repositories for high dimensional research data exist in the public domain, most focus on a single data type and do not support integration across multiple technologies. G-DOC contains data for more than 2,500 patients with breast cancer and almost 800 patients with gastrointestinal cancer, all of which are handled in a manner that allows maximum integration. We believe that G-DOC will help facilitate systems medicine by allowing easy identifi cation of trends and patterns in integrated datasets and will hence facilitate the use of better targeted therapies for cancer. One obvious area for expansion of the G-CODE/G-DOC platform infrastructure is to support next-generation sequencing (NGS), which is a highly enabling and transformative emerging technology for the biomedical sciences. Nonetheless, eff ective utilization of these data is impeded by the substantial handling, manipulation and analysis requirements that are entailed. We have concluded that cloud computing is well positioned to fi ll these gaps, as this type of infrastructure permits rapid scaling with low input costs. As such, the Georgetown University team is exploring the use of the Amazon EC2 cloud and the Galaxy platform to process whole exome, whole genome, RNA-Seq and chromatin immunoprecipitation (ChIP)-Seq NGS data. The processed NGS data will be integrated into G-DOC to ensure that they can be analyzed in the full context of other omics data. Likewise, all G-CODE projects will simultaneously benefi t from these advances in NGS data handling. Through technology re-use, the G-CODE infrastructure will accelerate progress in a variety of ongoing programs that are in need of integrative multi-omics analysis and will advance our opportunities to practice eff ective systems medicine in the near future. Background In this work, we study the benefi ts of using optical maps to improve genome assembly. Many modern assembly algorithms rely on a de Bruijn graph paradigm to reconstruct a genome from short reads. Ambiguities caused by repeats within the genome cause the fi nal assembly to be broken up into many contigs, because the assembler does not have enough information to fi nd the one correct traversal of the graph. Optical mapping technology can be useful for determining the correct path in the de Bruijn graph, through providing estimates on the locations of one or more restriction enzyme patterns in the genome, thereby constraining the possible traversals of the graph to only those that are consistent with the map. A particular traversal that does not align well with the optical map can be discarded as incorrect. Previous work has shown how to construct optical maps [1,2] for scaff olding contigs [3]. Methods Our algorithm relies on a depth-fi rst search strategy. As the depthfi rst search proceeds and its corresponding sequence is extended, we check whether the resultant sequence would generate an optical map that matches the optical map of the genome. If the candidate in silico optical map matches the optical map of the genome, we proceed with the depth-fi rst search. Otherwise, we backtrack in the depth-fi rst search until we fi nd a path that covers the entire graph and whose sequence has an optical map that matches the optical map of the entire genome. Although the total number of paths in the de Bruijn graph can be exponential in the number of nodes and edges in the graph [4], a reference optical map can eff ectively prune the search space of paths. To improve performance, we start by fi nding edges in the de Bruijn graph that can be uniquely placed on the optical map. These edges, which we call landmark edges, can also help guide our depth-fi rst search. Although there may be multiple paths in the de Bruijn graph that can yield sequences with optical maps that match the genome's optical map, these paths all yield very similar sequences in most cases.

An amalgamated risk estimation model (REM) and assay integration into future REMs
Results Given modest assumptions about the errors in the optical map, initial simulations show that our algorithm is very eff ective at assembling bacterial genomes, given read lengths of 100 or longer. The majority of our assemblies match the original sequences used in our simulations very closely. We will also present the results of simulations aimed at measuring the eff ect of errors on the correctness of the reconstruction and at measuring how the choice of restriction enzymes can improve the sequence assembly. Conclusions Our work shows that optical maps can be used eff ectively to aid in genome assembly. We are currently extending our approach to handle much larger graphs and to tolerate higher amounts of mapping error. In our fi nal assembly, we would also like to be able to detect and mark regions that we are less certain about and regions that we are confi dent are correct.
defi ne functional diversity in comparison to organismal ecology, including an example of microbial metabolism linked to specifi c organisms and to host phenotype (vaginal pH) in the posterior fornix. We provide profi les of 168 functional modules and 196 metabolic pathways that were determined to be specifi c to one or more niches within the human microbiome, including details of glycosaminoglycan degradation in the gut. Understanding how and why these biomolecular activities diff er among environmental conditions or disease phenotypes is, more broadly, one of the central questions addressed by high-throughput biology. We have thus developed the linear discriminant analysis (LDA) eff ect size algorithm (LEfSe) to discover and explain microbial and functional biomarkers in the human microbiota and other microbiomes. We demonstrate this method to be eff ective for mining human microbiomes for metagenomic biomarkers associated with mucosal tissues and with diff erent levels of oxygen availability. Similarly, when applied to 16S rRNA gene data from a murine ulcerative colitis gut community, LEfSe confi rms the key role played by Bifi dobacterium in this disease and suggests the involvement of additional clades, including the Clostridia and Metascardovia. A quantitative validation of LEfSe highlights a lower false positive rate, consistent ranking of biomarker relevance, and concise representations of taxonomic and functional shifts in microbial communities associated with environmental conditions or disease phenotypes. Implementations of both methodologies are available at the Huttenhower laboratory's website [1,2]. Together, they provide a way to accurately and effi ciently characterize microbial metabolic pathways and functional modules directly from high-throughput sequencing reads and, subsequently, to identify organisms, genes or pathways that consistently explain the diff erences between two or more microbial communities. This has allowed the determination of community roles in the HMP cohort, as well as their niche and population specifi city, which we anticipate will be applicable to future metagenomic studies. High-throughput sequencing (HTS) is an emerging technology that promises to deliver unparalleled information on genomic variations. As technology evolves and matures, and as a deeper understanding of this technology is gained, new and upgraded tools for analyzing HTS will become available and will need to be evaluated and validated. To facilitate this cumbersome task, we have developed an HTS validation framework into which both in-housegenerated synthetic datasets and well-characterized experimental datasets have been incorporated for controlled testing and evaluation of these analysis tools. Currently, the framework can be used to assess algorithms for short-read mapping, variant calling and RNA-Seq-derived gene expression measurements. The framework is deployed in the Amazon EC2 cloud so that it is available to the broader research community. Using our framework, researchers can further validate interfaced applications with preferred parameters, upload their own datasets for processing, and interface new applications with the framework for validation and comparison. We report the performance of several alignment, variant calling and RNA-Seq analytic tools that have been tested with our framework. We also provide feedback on the challenges and benefi ts of Amazon EC2 deployment.
Cite abstracts in this supplement using the relevant abstract number, e.g.: Liu X, et al.: A high-throughput-sequence analysis infrastructure technology investigation framework for the evaluation of next-generation sequencing software. Genome Biology 2011, 12(Suppl 1):P48.