REAPR: a universal tool for genome assembly evaluation
© Hunt et al.; licensee BioMed Central Ltd. 2013
Received: 7 March 2013
Accepted: 27 May 2013
Published: 27 May 2013
Methods to reliably assess the accuracy of genome sequence data are lacking. Currently completeness is only described qualitatively and mis-assemblies are overlooked. Here we present REAPR, a tool that precisely identifies errors in genome assemblies without the need for a reference sequence. We have validated REAPR on complete genomes or de novo assemblies from bacteria, malaria and Caenorhabditis elegans, and demonstrate that 86% and 82% of the human and mouse reference genomes are error-free, respectively. When applied to an ongoing genome project, REAPR provides corrected assembly statistics allowing the quantitative comparison of multiple assemblies. REAPR is available at http://www.sanger.ac.uk/resources/software/reapr/.
KeywordsGenome assembly validation evaluation
The volume of genome sequence data continues to increase exponentially yet methods that reliably assess the quality of assembled sequence are lacking. In an attempt to categorise the quality of genome assemblies, Chain et al.  proposed a series of qualitative descriptions. Although these serve as a useful guide, they do not provide statistical or numerical comparisons of data quality apart from the extreme case of a 'finished' sequence. The recent advent of so-called next generation sequencing (NGS) has seen a dramatic increase in the rate of production of new genome sequences, with a growing proportion of genome projects classified as 'permanent draft' . Moreover, most published assemblies do not get classified but are in fact also of 'draft' quality , which is the least accurate of all the categories. Relatively few reference genomes undergo continuous and rigorous quality improvement to repair errors. Two notable exceptions are the human genome  and the Plasmodium falciparum genome , where versioned error correction allows the comparison of sequence improvements over time. The reliability of reference sequence data is crucial for the interpretation of downstream functional genomic analysis and thus a metric indicating the genome wide accuracy of the reference sequence is essential.
Over 35 different tools ('assemblers') are available to perform de novo genome assembly . The assembly of the short reads produced by NGS technology is however known to be problematic [7, 8], despite the high coverage and range of insert sizes available. The precise behaviour of assemblers on a given genome is hard to predict without prior knowledge of its base composition, size, repetitive sequences and levels of polymorphism. Often the solution is to run assemblies with multiple tools or parameters and pick the best one based on summary statistics. Frequently, contig or scaffold N50 sizes are reported (the contig/scaffold size above which half the genome is represented) but although these are supposed to indicate contiguity (and certainly not accuracy), the frequent inclusion of incorrectly joined sequences provides a false boost to N50s despite reducing the accuracy of the genome consensus sequence. A better approach is to make a more informed decision on the best assembly by considering the real contiguity together with the errors in each assembly. Recent assembler evaluations GAGE  and Assemblathon 1  highlighted the variability in performance of assemblers when given different input data or when changing their parameters. However, studies such as these require a known reference genome in order to assess the assemblies - a luxury that is unavailable when producing a de novo assembly.
The development of genome assembly analysis tools that do not require the use of a reference sequence for comparison is currently an active area of research, with a few tools already available. All tools share the similarity that they use the position of read pairs within an assembly to perform their analysis. Amosvalidate  was developed before the introduction of NGS, requires a file format produced by few assemblers and does not scale well to the large volumes of data typified by modern genome projects. Subsequent tools were recently introduced to work with NGS, all of which analyse assemblies using remapped reads and are effective at determining the best assembly from a set of assemblies of the same data. CGAL  and ALE  both produce a summary likelihood score of an assembly, with ALE also reporting four likelihood scores for each base. FRCbam  uses many metrics to identify 'features', which correspond to erroneous regions of an assembly and are used to plot a feature response curve . The best assembly can be determined by overlaying these curves.
However, all of these tools lack the crucial ability to transform metrics into accurate error calls, or to report a single score for each base that defines whether the assembly is correct or wrong at any given position. Therefore we developed a reference-free algorithm (REAPR - Recognition of Errors in Assemblies using Paired Reads), applicable to large genomes and NGS data, with two principle aims: to score every base for accuracy and to automatically pinpoint mis-assemblies. The output is aimed to be as useful and informative as possible to the end-user and includes the bases identified as 'error-free' (see later for a definition), the location of assembly errors, and a new assembly that has been broken at points of assembly error. This information allows the N50 to be recalculated into the corrected N50 metric, similarly to previous studies that required a reference sequence [9, 10]. Thus, the combination of the number of error-free bases and the corrected N50 can now provide an effective summary of any genome assembly.
Results and discussion
Overview of the REAPR pipeline
Since a read cannot map to a sequencing gap (a region of ambiguous bases, or Ns), the theoretical FCD changes in the presence of a gap and a correction is applied to the FCD error calculation (Figure 1b(v), Additional file 1, Section 2.3), enabling the identification of scaffolding errors. In this way, REAPR scans along the entire genome, constructing the FCD at each base (Additional file 2), calculating the FCD error and identifying mis-assemblies.
In order to measure local accuracy REAPR uses proper read pairs that map to just one position of the assembly, with their entire length matching perfectly, to generate the read depth at every base of the assembly. By default, a given base is designated as locally error-free if it has at least five such reads aligned to it, but this is a parameter that can be changed by the user.
REAPR keeps track of several other metrics at every base of the genome. In terms of mis-assemblies, the most important of these is the fragment coverage where a value of zero returns an error. If it is non-zero then the value of the FCD error is taken into account. Any region that has no fragment depth, or has fragment distribution around a base that causes an FCD error, is reported as a mis-assembly. If this region contains a gap then it is likely to have arisen because two contigs have been falsely joined by read pairs that we term a scaffolding error, otherwise it is a simply an error in the assembled block of sequence that we term a contig error. In short, an assembly error call is triggered by either a lack of - or irregular - fragment coverage.
REAPR also outputs a warning for each of the following types of less serious inconsistencies in the assembly. A small deletion or insertion error often causes reads to be 'soft-clipped' (that is, some terminal bases ignored) in order for them to align to the assembly at the position of the error (see Additional file 1, Figure S2). Regions within an assembly where reads mapped in the wrong orientation, or as singletons, can aid in accurately determining the position of an FCD error caused by a scaffolding error or an incorrect assembly of a repetitive sequence. The latter pose a major challenge to assemblers, often resulting in collapsed repeats assembled into fewer copies than exist in the real genome. A region is flagged as a repeat by REAPR if the observed coverage is more than twice the expected coverage, after correcting for any GC bias present in the reads mapped to the assembly (Additional file 1 Figure S3d).
Scoring each base of the assembly
REAPR assigns a score to every base of the assembly, with priority given to the perfect and unique read-pair coverage and the FCD error over other metrics. A given base is considered to be error-free, scoring one, if its FCD error is sufficiently small (see online Methods) and it is locally error-free (based on perfectly and uniquely mapped read depth, as defined above). This combination captures both the local accuracy and the presence of larger scale errors in an assembly, so that error-free bases represent the regions of the assembly that are extremely likely to be correct. Otherwise a score from zero to one is assigned, based on the number of other metrics that fall outside acceptable limits, with zero being the worst score. Briefly, the metrics used are the read depth and type of paired mapping, such as orphaned reads or reads in the wrong orientation, fragment depth and the presence of soft clipping (see online Methods for full details).
Analysis of reference genomes
A summary of REAPR results on a range of genome sequences.
Total length (Mb)
Total gap length
Original N50 (Mb)
Called by REAPR
Error-free bases (%)
S. aureus TW20 k71
S. aureus, GAGE Velvet
P. falciparum de novo k55
P. falciparum v2.1.4
P. falciparum v3
C. elegans WS228
M. musculus GRCm38
H. sapiens GRCh37
Next we applied REAPR to the C. elegans reference genome using a large insert size library that was derived from whole genome amplified (WGA) DNA. Ninety percent of the genome was reported to be error-free. The FCD error metric flagged up 842 errors, with manual analysis revealing that many of these error calls were caused by extremely uneven coverage across the genome. This unevenness was presumably a result of the WGA step used in the sequencing protocol (Additional file 1, Figure S5). However, the 20 regions with the largest FCD error were chosen for further analysis by PCR (Additional file 1, Figure S6, Table S6). Of the eight loci we were able to amplify, seven had a different size (>1.5 kb) from that predicted by the reference genome. Therefore REAPR successfully identified these regions as incorrect in the reference genome.
REAPR also scales to the human and mouse genomes, requiring less memory and CPU time than that of the mapping step (Additional file 1, Table S7). Ignoring sequencing gaps, we found 86% and 82% of bases to be error free, in the reference genomes of H. sapiens and M. musculus, respectively.
Application to de novo assemblies
We finally tested REAPR's applicability to a more challenging genome project by applying it to a de novo assembly of P. falciparum, which contained 11,636 sequencing gaps. In this case 55 scaffolding errors, again manually verified, were correctly identified with only one false-positive reported (Additional file 1, Table S10).
It should be noted that the ability of REAPR to detect errors is inherently limited by aspects of the sequencing technology such as insert size and read length meaning that some assembly errors remain unreported (see Additional file 1 for a full explanation). Further it should also be noted that assemblies of diploid (or polyploid) genomes still present a considerable challenge. Depending on the divergence between haplotypes, sequences may assemble separately or merge together. REAPR will call errors at the boundaries of regions where sequence-coverage differs, such as the boundary between merged and separated haplotypes. However, fully testing this functionality remains an area for future development alongside the development of assembly technologies that allow the sequences of homologous chromosomes to be assembled independently.
Corrected assembly statistics
Therefore, when applied to each of a series of de novo assemblies, REAPR arms the user with a robust method of comparing the output of different assemblers, so that the best assembly can be chosen for publication using standard but corrected metrics. To demonstrate this we applied REAPR to an ongoing genome project on the nematode Brugia pahangi. Figure 4c compares the progress of the assembly when monitored by standard N50 and REAPR corrected statistics at different steps of the improvement pipeline. Although the N50 itself does not increase at each stage, the corrected N50 shows a consistent increase and we see that genuine improvements have been made to the assembly.
Here we have described the first algorithm that translates per-base metrics into error calls of reference sequences and de novo assemblies using NGS data. Establishing the quality of those sequences will become increasingly important as the assembly process shifts to more automated methods . For example, REAPR correctly identified the ALLPATHS assembly to be the best of the GAGE S. aureus assemblies, without using a reference sequence. This assembly had the fewest error calls, the greatest number of error-free bases and the fewest warnings reported by REAPR (Additional file 1, Tables S1-3). Therefore we propose that REAPR should be applied to all genome projects prior to computing standard contiguity statistics (such as the N50). In this way the quality of assemblies and performance of assemblers can be compared robustly via a method that produces metrics that are constant between methodologies or datasets. By also providing a per base value for the accuracy of a sequence, that can be easily overlaid and viewed by the end-user, different genomes or assembly versions can be accurately compared and downstream analysis enhanced by enabling the end-user to be aware of regions of questionable accuracy.
Materials and methods
The read mapper SMALT  was used in all examples to map sequencing reads to assemblies. The entire command lines used are given in Additional file 1, but we note that the -x option was always used, so that each read in a mate pair was independently mapped thereby avoiding the false placement of a read near to its mate, instead of elsewhere with a better alignment. The -r option was also always used to randomly place reads which map repetitively, to prevent all repetitive regions of the reference sequence from having zero read coverage. After mapping, duplicate read-pairs were marked using the MarkDuplicates function of Picard version 1.47 .
The assembly analysis algorithm was implemented in a tool called REAPR: 'recognition of errors in assembly using paired reads'. The pipeline is simple to run, requiring as input an assembly in FASTA format and read pairs in FASTQ format. Alternatively, the user can map the reads to the assembly and provide a BAM file . The steps in the pipeline are outlined in Figure 1 and described below (see Additional file 1 for full details of each stage).
Initially, input to the REAPR pipeline must be generated, starting with the unique and perfectly aligned read coverage of a high quality set of paired reads. For small genomes (<100 MB), this is calculated using the extremely fast but high memory tool SNP-o-matic . For large genomes, the coverage is extracted from a BAM file of reads mapped using SMALT. This perfect and unique mapping information, together with a BAM file of the larger insert size reads mapped to the genome, is used as input to the REAPR pipeline. REAPR version 1.0.11 was used in all cases, with the default parameters.
The pipeline begins with a pre-processing step that estimates various statistics, such as average fragment length and depth of coverage, using a sample of the genome. In particular, GC bias is accounted for by calculating the expected fragment coverage at any given value of GC content. This correction to the fragment coverage is applied in subsequent stages of the pipeline. The method used is to take a LOWESS line through a scatter plot of fragment coverage versus GC content (see Additional file 1, Figure S3d).
The next stage calculates statistics at each base of the assembly, using the information in the input BAM file and the perfect and uniquely mapped read depth. These statistics are used to call errors in the assembly and to score each base of the assembly. We shall use 'inner fragment' to mean the inner mate pair distance or, equivalently, a fragment without including the reads (see Additional file 1 Figure S2a). The metrics calculated are read depth and type of read coverage, inner fragment coverage, error in inner fragment coverage (corrected for GC content), FCD error and amount of soft clipping. The metrics are explained in more detail below and in Additional file 1.
Recall that the FCD error at each base of an assembly is taken to be the area between the observed and ideal fragment coverage distributions (see Figure 1c). It is normalized for both fragment depth and mean insert size so that results are comparable for data from different libraries. A correction is made for the presence of the nearest gap, if it lies within one insert size of the base of interest (see Additional file 1). If a base has zero fragment coverage then this metric cannot be used and the assumption is that the assembly is incorrect. The exception to this is where a gap has length longer than half the average insert size, in which case it is impossible to determine if this scaffolding is correct and therefore no further analysis is performed.
In addition to the absolute count of read coverage, the type of read coverage is considered. At each base, and for each strand, the proportion of reads of the following types is calculated: proper read pairs, defined to be in the correct orientation and insert size, which should be in the majority if the genome is correct; orphaned reads, whereby a read's mate is either unmapped or mapped to a different chromosome; reads with the correct orientation but wrong insert size; and read pairs with an incorrect orientation.
Most read mapping tools are capable of soft-clipping reads, where most of a read is aligned to the genome, but a few bases at either end of the read do not match. In this case the read is still reported as mapped, but the mismatching bases are not considered as part of the alignment and designated as soft-clipped (Additional file 1, Figure S2c). At each base, the number of alignments is counted that start or end at that base due to a soft-clipped read.
In order to call assembly errors from a given metric, a minimum window length is considered and appropriate minimum and maximum values. Any region of length no smaller than the window length and with at least 80% of the bases falling outside the acceptable range is reported. For example, a collapsed repeat is called if the relative error in fragment coverage is at least two for 80% of the bases in a stretch of at least 100bp. The default choice of parameter for each metric is described in the Additional file 1. In the actual implementation, the user can choose all parameters.
As described earlier, each base scores one if it is covered by at least five perfect and uniquely mapped reads, and the FCD error is acceptable. If either of these tests fail, then the score is set to the number of tests that pass (considering all per-base metrics) scaled from zero to one, that is, a base scores zero if every test fails. The FCD error cutoff is chosen by sampling windows from the genome, then for each window the cutoff in FCD error needed to call that window as an error is calculated. In other words, for each window we find the value c such that 80% of the values in that window are greater than c. The proportion of failed windows as a function of cutoff value is plotted (Figure 2). The cutoff value for the FCD error is chosen to be the first value found, working from largest to smallest, such that the magnitude of the first and second derivatives (normalized to have a maximum magnitude of 1) of the plot are both at least 0.05.
REAPR reports assembly errors and warnings in a GFF file, compatible with most genome viewers such as Artemis . Regions with a high FCD error or low fragment coverage are reported as an error, whereas regions that fail any other tests are output as warnings for manual inspection. A summary spreadsheet is produced containing error counts, broken down in to each type of error, for each contig and for the whole assembly. REAPR also produces a new assembly based on the error calls by breaking the genome wherever an error is called over a gap. Error regions within contigs are replaced with Ns, enabling them to be accurately reassembled locally by a gap closing tool [26, 27]. A second run of REAPR can be performed after gap closing to verify any new sequenced added to the assembly. REAPR also generates plot files, compatible with Artemis, of all the statistics examined at each base for easy visualisation (see Additional file 1, Figure S7 for an example).
De novo assemblies
The de novo assemblies of S. aureus and P. falciparum were produced using similar methods (see Additional file 1 for full details). Short insert Illumina reads were assembled using Velvet  version 1.2.03. These assemblies were scaffolded iteratively with SSPACE  version 2 using the short insert reads, followed by further rounds of scaffolding with larger insert reads, where available.
Manual comparison between the de novo assemblies and reference genomes of S. aureus and P. falciparum were performed using ACT . BLAST hits between the sequences were generated for viewing in ACT using blastall version 2.2.15 with the settings -p blastn -W 25 -F T -m 8 -e 1e-20.
When counting scaffolding error calls in S. aureus, the Velvet assembly was found to contain three problematic regions, with many gaps and errors due to repetitive sequences. Each of these regions was counted as one scaffolding error for the purpose of calculating REAPR's performance at error calling.
The read sets used for P. falciparum assemblies were Illumina 500bp insert, Illumina 3 kb insert and 454 8 kb insert reads. The short insert Illumina reads were used to generate perfect and uniquely mapped read depth, and also to call collapsed repeats. All other errors were identified using the 454 reads.
Perfectly mapped and unique read depth was generated for the C. elegans genome (WS228) using three Illumina lanes combined and the larger insert size dataset comprised four combined Illumina lanes. Prior to mapping the latter reads, inner adaptor sequences were removed using in-house scripts based on SSAHA2 , retaining read pairs where each mate of the pair had a length of at least 35bp. PCR primers were designed to amplify the top 20 FCD error regions using AcePrimer 1.3 .
High coverage Illumina data  were used to analyse the human and mouse reference genomes. For each organism, the dataset comprised short insert data and more than one 2-3 kb insert 'jumping' library. The short insert data were used to compute the perfect and uniquely mapped read depth and the 2-3kb libraries were combined to obtain enough coverage for analysis with REAPR.
REAPR is open source and runs under Linux, with modest run time and memory requirements (Additional file 1, Table S7). It is written in C++ and Perl, relying on existing open source tools [23, 33, 34] and the BamTools C++ API . A virtual machine is provided to enable Windows and Mac users to run REAPR.
The primary data for Brugia pahangi are available at the Short Reads Archive (SRA) under accession codes ERR070030 and ERR068352.
Other publicly available datasets used in this manuscript can be found in SRA under the accession codes: ERR142616 and SRR022868 (S. aureus); ERR034295, ERR163027-9 and ERR102953-4 (P. falciparum); ERR068453-6 and ERR103053-5 (C. elegans); SRR0676 (M. musculus); and SRR067577-9 and SRR0677 (H. sapiens).
Fragment Coverage Distribution
Next Generation Sequencing
Recognition of Errors in Assemblies using Paired Reads.
We acknowledge Bernardo Foth, Adam Reid and Isheng J Tsai for proofreading the manuscript, and J. Tsai for the Brugia pahangi example. Martin Hunt and Thomas Otto were supported by the European Union 7th framework EVIMalaR and Mandy Sanders and Matthew Berriman by the Wellcome Trust (grant number: 098051). Taisei Kikuchi was supported by JSPS KAKENHI (grant number: 24780044). Chris Newbold was supported by the Wellcome Trust (grant number: 082130/Z/07/Z).
- Chain PS, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, Cole JR, Ding Y, Dugan S, Field D, Garrity GM, Gibbs R, Graves T, Han CS, Harrison SH, Highlander S, Hugenholtz P, Khouri HM, Kodira CD, Kolker E, Kyrpides NC, Lang D, Lapidus A, Malfatti SA, Markowitz V, Metha T, et al: Genomics. Genome project standards in a new era of sequencing. Science. 2009, 326: 236-237. 10.1126/science.1180614.PubMedView ArticleGoogle Scholar
- Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC: The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2012, 40: D571-579. 10.1093/nar/gkr1100.PubMedPubMed CentralView ArticleGoogle Scholar
- Mak HC: Genome interpretation and assembly-recent progress and next steps. Nat Biotechnol. 2012, 30: 1081-1083. 10.1038/nbt.2425.View ArticleGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.PubMedView ArticleGoogle Scholar
- Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DM, et al: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002, 419: 498-511. 10.1038/nature01097.PubMedView ArticleGoogle Scholar
- Sequence assembly. [http://en.wikipedia.org/wiki/Sequence_assembly]
- Treangen TJ, Salzberg SL: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2011, 13: 36-46.PubMedPubMed CentralGoogle Scholar
- Alkan C, Sajjadian S, Eichler EE: Limitations of next-generation genome sequence assembly. Nat Methods. 2011, 8: 61-65. 10.1038/nmeth.1527.PubMedPubMed CentralView ArticleGoogle Scholar
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marcais G, Pop M, Yorke JA: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2011, 22: 1196-Google Scholar
- Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol I, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, et al: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 2011, 21: 2224-2241. 10.1101/gr.126599.111.PubMedPubMed CentralView ArticleGoogle Scholar
- Phillippy AM, Schatz MC, Pop M: Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 2008, 9: R55-10.1186/gb-2008-9-3-r55.PubMedPubMed CentralView ArticleGoogle Scholar
- Rahman A, Pachter L: CGAL: computing genome assembly likelihoods. Genome Biol. 2013, 14: R8-10.1186/gb-2013-14-1-r8.PubMedPubMed CentralView ArticleGoogle Scholar
- Clark SC, Egan R, Frazier PI, Wang Z: ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. Bioinformatics. 2013, 29: 435-443. 10.1093/bioinformatics/bts723.PubMedView ArticleGoogle Scholar
- Vezzi F, Narzisi G, Mishra B: Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS One. 2012, 7: e52210-10.1371/journal.pone.0052210.PubMedPubMed CentralView ArticleGoogle Scholar
- Narzisi G, Mishra B: Comparing de novo genome assembly: the long and short of it. PLoS One. 2011, 6: e19175-10.1371/journal.pone.0019175.PubMedPubMed CentralView ArticleGoogle Scholar
- Holden MT, Lindsay JA, Corton C, Quail MA, Cockfield JD, Pathak S, Batra R, Parkhill J, Bentley SD, Edgeworth JD: Genome sequence of a recently emerged, highly transmissible, multi-antibiotic- and antiseptic-resistant variant of methicillin-resistant Staphylococcus aureus, sequence type 239 (TW). J Bacteriol. 2010, 192: 888-892. 10.1128/JB.01255-09.PubMedPubMed CentralView ArticleGoogle Scholar
- Riley MC, Kirkup BC, Johnson JD, Lesho EP, Ockenhouse CF: Rapid whole genome optical mapping of Plasmodium falciparum. Malar J. 2011, 10: 252-10.1186/1475-2875-10-252.PubMedPubMed CentralView ArticleGoogle Scholar
- Kidgell C, Volkman SK, Daily J, Borevitz JO, Plouffe D, Zhou Y, Johnson JR, Le Roch K, Sarr O, Ndir O, Mboup S, Batalov S, Wirth DF, Winzeler EA: A systematic map of genetic variation in Plasmodium falciparum. PLoS Pathog. 2006, 2: e57-10.1371/journal.ppat.0020057.PubMedPubMed CentralView ArticleGoogle Scholar
- Kraemer SM, Kyes SA, Aggarwal G, Springer AL, Nelson SO, Christodoulou Z, Smith LM, Wang W, Levin E, Newbold CI, Myler PJ, Smith JD: Patterns of gene recombination shape var gene repertoires in Plasmodium falciparum: comparisons of geographically diverse isolates. BMC Genomics. 2007, 8: 45-10.1186/1471-2164-8-45.PubMedPubMed CentralView ArticleGoogle Scholar
- Carver TJ, Rutherford KM, Berriman M, Rajandream MA, Barrell BG, Parkhill J: ACT: the Artemis Comparison Tool. Bioinformatics. 2005, 21: 3422-3423. 10.1093/bioinformatics/bti553.PubMedView ArticleGoogle Scholar
- SMALT. [http://www.sanger.ac.uk/resources/software/smalt/]
- Picard. [http://picard.sourceforge.net/]
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMedPubMed CentralView ArticleGoogle Scholar
- Manske HM, Kwiatkowski DP: SNP-o-matic. Bioinformatics. 2009, 25: 2434-2435. 10.1093/bioinformatics/btp403.PubMedPubMed CentralView ArticleGoogle Scholar
- Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA: Artemis: An integrated platform for visualisation and analysis of high-throughput sequence-based experimental data. Bioinformatics. 2012, 28: 464-469. 10.1093/bioinformatics/btr703.PubMedPubMed CentralView ArticleGoogle Scholar
- Boetzer M, Pirovano W: Toward almost closed genomes with GapFiller. Genome Biol. 2012, 13: R56-10.1186/gb-2012-13-6-r56.PubMedPubMed CentralView ArticleGoogle Scholar
- Tsai IJ, Otto TD, Berriman M: Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 2010, 11: R41-10.1186/gb-2010-11-4-r41.PubMedPubMed CentralView ArticleGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18: 821-829. 10.1101/gr.074492.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011, 27: 578-579. 10.1093/bioinformatics/btq683.PubMedView ArticleGoogle Scholar
- Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res. 2001, 11: 1725-1729. 10.1101/gr.194201.PubMedPubMed CentralView ArticleGoogle Scholar
- McKay SJ, Jones SJ: AcePrimer: automation of PCR primer design based on gene structure. Bioinformatics. 2002, 18: 1538-1539. 10.1093/bioinformatics/18.11.1538.PubMedView ArticleGoogle Scholar
- Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA. 2011, 108: 1513-1518. 10.1073/pnas.1017351108.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H: Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011, 27: 718-719. 10.1093/bioinformatics/btq671.PubMedPubMed CentralView ArticleGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. 2010, Vienna: R Foundation for Statistical ComputingGoogle Scholar
- Barnett DW, Garrison EK, Quinlan AR, Stromberg MP, Marth GT: BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011, 27: 1691-1692. 10.1093/bioinformatics/btr174.PubMedPubMed CentralView ArticleGoogle Scholar
- Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA: Circos: an information aesthetic for comparative genomics. Genome Res. 2009, 19: 1639-1645. 10.1101/gr.092759.109.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.