ALLMAPS: robust scaffold ordering based on multiple maps
© Tang et al.; licensee BioMed Central. 2015
Received: 15 June 2014
Accepted: 15 December 2014
Published: 13 January 2015
The ordering and orientation of genomic scaffolds to reconstruct chromosomes is an essential step during de novo genome assembly. Because this process utilizes various mapping techniques that each provides an independent line of evidence, a combination of multiple maps can improve the accuracy of the resulting chromosomal assemblies. We present ALLMAPS, a method capable of computing a scaffold ordering that maximizes colinearity across a collection of maps. ALLMAPS is robust against common mapping errors, and generates sequences that are maximally concordant with the input maps. ALLMAPS is a useful tool in building high-quality genome assemblies. ALLMAPS is available at: https://github.com/tanghaibao/jcvi/wiki/ALLMAPS.
A hierarchical approach is typically adopted for the assembly of large eukaryotic genomes - starting with identifying overlapping reads to build contigs, then adding paired reads to build scaffolds, and finally ordering and orientating scaffolds together to assemble chromosomes using various sources of long distance information [1,2]. During this hierarchical process, larger and larger sequence chunks are assembled and finally ‘anchored’ onto chromosomal-sized pieces. The reconstructed chromosomal sequences are often referred to as ‘pseudo-molecules’ or ‘pseudo-chromosomes’ [3,4]. The prefix ‘pseudo’ implies that the assemblies may still contain uncertainties, and only represent a single specific hypothesis that needs to be evaluated in view of all available evidence .
One of the major steps in producing high quality genome assemblies is to use a variety of mapping information, including genetic maps, physical maps, cytological maps, optical maps, or synteny with related taxa to reconstruct the most likely chromosomal assemblies [4,6-10]. Many specialized tools have been developed over the years to assist with the various steps of the hierarchical genome assembly pipeline, including contigging , scaffolding [1,2], optical map alignment [9,12], and synteny-guided assembly [4,13]. In comparison, tools for anchoring of scaffolds based on more than one genetic map remain under-developed. Our method ALLMAPS, fills this algorithmic gap by optimizing agreement among multiple maps to order and orient scaffolds. While initially designed to use genetic maps to guide the chromosomal anchoring process and produce a close-to-optimal scaffold configuration, ALLMAPS can utilize a variety of techniques for generating physical and comparative maps of chromosomes.
A key feature of ALLMAPS is its built-in ability to handle multiple maps in a unified framework. Organisms that have been subjected to a moderate amount of research often have several genetic maps available, likely constructed by different labs at different points in time. These genetic maps are often derived from different mapping populations using genetically diverse parental lines, making consensus mapping across multiple maps difficult or sometimes infeasible . Maps constructed from F1 population derived from two heterozygous individuals typically produce two genetic maps, one for each gender, based on a backcross model [15,16]. Maps constructed using different software, such as R/QTL , MSTMAP , and JOINMAP , often differ in their final ordering of genetic markers. Recent Genotyping-by-Sequencing (GBS) technology has been used to generate high-density genetic maps through the use of multiplexing and high throughput sequencing . There is a pressing need to incorporate all available mapping information during the construction of chromosomal assemblies for organisms of interest, irrespective of differences due to the population, employed markers, or mapping software.
There are several benefits of using multiple genetic maps for scaffold anchoring. Genetic maps can vary in terms of recombination frequency, segregation distortion, presence-absence variation (PAVs), and chromosome regions with few mapped polymorphic sites. If such differences among maps are complementary, it may be possible to anchor scaffolds on one map that cannot be anchored on another. In addition, combining evidence from multiple maps to create a ‘meta-order’ can address potential weaknesses intrinsic in any single map. The assimilation of different maps means that many legacy maps (even those created by different labs) can be utilized to support the final scaffold order of a common reference genome assembly. In summary, multiple independent maps can provide more accurate scaffold orientation and ordering information, allowing a greater portion of the scaffolds to be anchored than would be possible with any single map alone.
The use of a single map to guide scaffold ordering is a computationally trivial problem. The scaffolds could simply be sorted based on the average map locations of the markers on each scaffold and there would be little ambiguity in the reconstructed order. In contrast, the goal of ALLMAPS is to combine information from multiple genetic maps to compute a common scaffold order. This is a more difficult challenge. Multiple maps may suggest incompatible scaffold orders and orientations, or ‘conflicts’. Computationally, each map implies a ‘partial’ scaffold ordering, which can be modeled as a directed acyclic graph (DAG), with edges representing the relative order between scaffolds. In the absence of conflict, the order of physical scaffolds can be solved by merging multiple DAGs followed by a topological sort. When conflicts exist between various maps, the resulting merged graph will contain conflicts, which must be resolved before direct sorting.
To build a consensus order, clear and objective rules for conflict resolution are necessary. Conflict resolution among input maps is conceptually similar to the problem of building consensus maps , but with important differences. In the case of building a consensus genetic map, a common set of markers must be present across most of the genetic maps in consideration. In general, we cannot expect to build a consensus map first and then use the consensus map to guide the scaffold anchoring. Even in the case when a consensus map is possible, too much information may be lost during the process of building consensus map, which itself is a difficult problem to solve . Further, in some applications especially in the case of maps generated via GBS , pairs of maps may share few or no markers but each genetic map can contain markers that are shared with the physical scaffolds.
Many studies making use of multiple maps employ ad hoc rules to resolve conflicts in the scaffold ordering among maps . While some of these rules may be effective in a particular study, they are inherently subjective and can become difficult to extrapolate to different genome datasets and replicate in future studies. In other cases, scaffold ordering may require arbitrary human decisions in regions of conflicting evidence [3,21]. As an increasing number of genome projects have access to multiple mapping data resources [3,8,10,21,22], a method that could accurately combine several lines of evidence to build high quality chromosomal assemblies has become essential.
Here, we implement a novel method, ALLMAPS, to address the current lack of computational tools for performing objective scaffold ordering based on colinearity of multiple maps. Colinearity, defined as the arrangement of one sequence in the same linear order as another sequence, is one of the most important criteria in evaluating map concordance [3,8,16] and evolutionary relatedness [4,23].
We highlight several salient features of ALLMAPS. First, we have formulated a clear, computable objective which is to maximize the sum of colinearity to multiple input maps, leading to better reproducibility in the anchoring process. Second, we allow variable weights in input maps, leading to better control in conflict resolution between different maps. Finally, we show that ALLMAPS can naturally be extended to incorporate other mapping evidence, including optical map and cross-species synteny, requiring minimal effort of data transformation. ALLMAPS is an elegant tool that promises to expedite genome assembly and facilitate the integration of various mapping evidence during the final stage of genome assembly.
Results and discussion
Computational complexity of the problem
The problem of ordering and orientation of genomic scaffolds is well known to be NP-hard [2,24]. A genomic map provides information that implies the relative placement of scaffolds. When either the genomic maps or the scaffold assemblies contain errors, or there are conflicts between the multiple input maps, the problem of finding the optimal scaffold ordering and orientations that satisfy the most constraints becomes intractable [2,24]. Studies on scaffolding based on read pairs largely rely on heuristics [1,2,24]. In the ALLMAPS implementation, we chose to use Genetic Algorithm (GA) instead of some other heuristics such as local search, hill climbing, and greedy strategy to avoid getting stuck in local optima [25,26]. Other related problems such as consensus mapping has adopted similar evolutionary strategy .
To assess the accuracy of ALLMAPS, we simulated a set of scaffolds with random orientations along the whole genome of Medicago truncatula, as the ‘truth’ to be evaluated against. The published Medicago genome has eight chromosomes, with 366 Mb in total length. Each of the eight chromosomes represents a typical complex eukaryotic chromosome. In order to have a length profile similar to real data, the scaffold length distribution of these simulated scaffolds was modeled from a preliminary assembly used in Mt4.0 (20,591 sequences in total, largest 346 Kb, scaffold N50 = 46 Kb) . A total of 8,000 genetic markers were generated from the Medicago genome following a uniform distribution and two-thirds of the markers were assigned to each genetic map following a random distribution along the genome. The marker density of each genetic map is 13.7 markers per Mb. To evaluate the accuracy of our program, we used the ratio between the longest monotonic subsequence (LMS) and the total number of markers as an indication of colinearity between the ALLMAPS answer and ‘truth’. This ratio varies between 0 and 1 - an accuracy of 1 means all markers are in perfect colinearity between the scaffold order and the input maps.
Next, we investigated the relationship between the accuracy and the numbers of genetic maps used for scaffolding. In this case, we introduced moderate level of errors (P inv = 0.2 and P trans = 0.2) into one genetic map (while not introducing errors into the other maps) and allowed ALLMAPS to take from one to eight genetic maps as input. As expected, ALLMAPS’s performance was improved as the number of available genetic maps increased (Figure 1C). The trend shows that big improvement of accuracy between one and two maps, and between two and three maps or more. The curve flattens when there are three input maps, suggesting that it is beneficial to have at least three maps for error correction. There is still noticeable improvement using more than three maps in this simulated study but the accuracy quickly approaches 1 when more maps are used as input. For a much stronger test, we introduced noise to all input maps at 20% errors (P inv = 0.2 and P trans = 0.2) and allowed ALLMAPS to take from one to eight genetic maps. When all input maps contained errors, ALLMAPS require more maps to be able to deduce the correct scaffold ordering, with accuracy approaching 0.9 only when eight maps were used (Figure 1D).
Application in construction of yellow catfish chromosomal assemblies
While simulated data revealed some basic properties of ALLMAPS, we applied ALLMAPS to a real-world genome project: yellow catfish (see Data availability). Genomic scaffolds were generated as part of the ongoing project to sequence the yellow catfish genome. We generated 100 bp paired end and mate pair reads (SRA accession: SRP050322) on an Illumina HiSeq 2000, which were assembled into scaffolds using ALLPATHS-LG . The yellow catfish project also generated genetic maps based on the progeny of a bi-parental cross between two individual heterozygous yellow catfish. Our goal was to use ALLMAPS to anchor scaffolds from a draft genome assembly of yellow catfish into pseudomolecules using these genetic maps. The scaffold assembly of yellow catfish consists of 9,224 scaffolds (N50 = 1 Mb) that comprise 718 Mb. The 161 scaffolds with a length of greater than 1 Mb (‘N50 scaffolds’) are our main focus in the anchoring process. SNPs derived from tGBS sequencing  were categorized into two subgroups basing on their segregation ratios in the population and parental genotypes. Based on a back-cross model , two genetic maps were constructed from each of the subgroups using R/QTL , that we called BCMale and BCFemale. Using the same subgroups of markers, but running a different mapping software (JOINMAP ), we obtained two additional maps that were called JMMale and JMFemale. We used all four maps (BCMale, BCFemale, JMMale, JMFemale) to order and orient scaffolds into pseudomolecules. We will show that creating two sets of maps with different software from the same data could improve the assembly.
Summary statistics for each of the four component maps (BCFemale, BCMale, JMFemale, JMMale, with equal weights) and final consensus anchoring (‘Anchored’) in the yellow catfish study
Markers per Mb
Scaffolds with 1 marker
Scaffolds with 2 markers
Scaffolds with 3 markers
Scaffolds with ≥4 markers
(Percent of genome)
Robust integration of multiple maps
When equal weights are given for each of the input maps, the final order is a majority-rule consensus among all input maps. However, users can also place a weight on each map to reflect their preference or confidence based on independent evaluations of accuracy. A useful usage pattern then emerges: the users can modify weights for the input maps, examine the summary report and diagnostic plots, and then iterate with different sets of weights. This is a supervised usage of ALLMAPS, as opposed to the default behavior of treating each map with the same weight (weight = 1). The map weights affect important aspects of the ALLMAPS algorithm, particularly on how the conflicts are resolved between maps. The weights affect to which chromosome a scaffold may anchor, as well as the final order and orientations between linked scaffolds. Naturally, the final scaffold configuration will be biased towards the map with the highest assigned weight.
Extension to other types of genomic maps
To test the performance of ALLMAPS on utilizing optical mapping data, we used the publicly available budgerigar dataset, which was also used in the Assemblathon 2 competition [31,32]. The budgerigar v6.3 scaffolds were aligned to the optical maps, based on which the mega-scaffolds were constructed and published [31,32]. Among a total of 49 suggested joins, ALLMAPS was able to recover 44 scaffold links (90%) with correct ordering and orientations. The remaining five links were manually inspected and were found not supported by optical mapping data (that is, in different optical map contigs). We suggest that these five discrepancies between ALLMAPS result and the published assembly might be due to other linking evidences such as read pairs that were not accessible by ALLMAPS.
To illustrate the full benefits of using multiple mapping technologies, we tested ALLMAPS on data derived from the genome of Medicago truncatula, where different types of genomic maps are publicly available (see Data availability). The Medicago genome project has developed a suite of maps that were incorporated into the Mt4.0 genome assembly . In particular, an optical map is available (Genbank nucleotide database accessions MAP_000013 to MAP_000020). Optical mapping is a technique to build high-resolution restriction endonuclease maps and has been widely used in reconstructing chromosomal assemblies [8-10]. Here, we treated matching restriction fragments between the optical map and in silico restriction fragment from the scaffolds as ‘markers’ with each marker having two coordinates for its position on a scaffold and on the optical map (Figure 3A). The second map is a genetic map constructed from a double-haploid population. For the third map, we use a comparative map between chickpea and Medicago. Chickpea has a high quality genome assembly and can be used to anchor Medicago scaffolds due to their evolutionary relatedness [8,33]. Synteny evidence can be utilized by treating syntenic gene pairs between the two species as ‘markers’, and transforming the data into the appropriate form for ALLMAPS (Figure 3B).
Different weights were assigned to each map in this integration, reflecting our relative confidence to each of the map: weight of 3 for the optical map, 2 for the genetic map, and 1 for the chickpea comparative map. In particular, the comparative synteny evidence should always be used with extra caution since the disruption of colinearity can be due to real evolutionary events such as chromosomal rearrangements, fusions, fissions, and translocations. Assigning the lowest weight to the comparative map ensures that it would never be considered when in conflict with the other two maps.
Summary statistics for three component maps and final consensus anchoring (‘Anchored’) in the Medicago study
OpticalMap (w = 3)
GeneticMap (w = 2)
Chickpea (w = 1)
Markers per Mb
Scaffolds with 1 marker
Scaffolds with 2 markers
Scaffolds with 3 markers
Scaffolds with ≥=4 markers
(Percent of genome)
Different mapping techniques tend to show different marker density characteristics and therefore likely complement each other in different regions along the chromosomes. Some genome projects that have had access to a multitude of maps were able to identify major discrepancies after checking two or more lines of evidence [10,21,33]. Our ALLMAPS framework is applicable to all coordinate-based maps or maps that can be converted into coordinate-based through transformation, therefore substantially reducing the overhead for finding the correct assembly tools to handle the plethora of mapping evidence. Using ALLMAPS to consolidate different maps substantially increases the completeness as well as the accuracy of the scaffold anchoring, generating very high quality draft assemblies.
When there is an unfortunate shortage of available genomic maps, for example, in ‘orphan’ species where there is little research investment in the past, we can still create consensus chromosomal assemblies based on comparative maps against multiple, closely-related genomes as a collection of ‘references’ . Each related genome may be assigned a weight, reflecting their evolutionary distance to the organism of interest. When only synteny data from multiple species are used, the assembled sequences by ALLMAPS might resemble an ‘ancestral’ chromosome arrangement, an idea exploited during the reconstruction of the Black Death agent Yersinia genome . Most existing tools for comparative assembly developed to date can only exploit a single ‘reference’ to assemble against [4,13]. Through the creative use of ALLMAPS, we can greatly expand the types of evidences that can be applied to a genome assembly even in situations where mapping evidence is scarce.
Factors affecting the performance of ALLMAPS
Recently, dense genetic maps can be generated with inexpensive genotyping technology such as GBS and RAD, and most published NGS genomes projects have already adopted these high-density genetic mapping protocols [15,16,20]. Optical maps often contain more markers because markers are derived from restriction fragments, which are often abundant in a large genome (for example, six-base cutter cuts every 4 Kb, on average). The density of markers in a comparative map is determined by the number of genes in conserved synteny blocks and divergence between the genomes in comparison .
Errors during the genome assembly and map making protocols could potentially propagate into the ALLMAPS assembly. The starting quality of the genome assembly based on read overlaps and read pairing is therefore crucial to the performance of ALLMAPS. A highly fragmented assembly could produce sub-optimal or erroneous results. In various mapping protocols, there are technological problems like simple genotyping errors that could lead to bad maps with missing or translocated segments. ALLMAPS warns about the chimeric scaffolds with mappings to multiple linkage groups, but relies on the user to provide relatively clean maps and scaffold assemblies to achieve the best results.
Specific genome structural features might cause additional problems during assembly, including chromosomal inversion, translocation, and segmental duplication. Map data that do not reflect genomic arrangement in the reference individual would cause incongruities among the input maps (for example, synteny data from species with rearrangements, or genetic maps from individuals with structural changes relative to the mapping parents). The errors in the input maps could potentially have a common biological cause. For example, segmental duplication disrupts genome assemblies reconstructed from short reads and might similarly affect various types of genomic maps . If the segments are divergent enough to be properly assembled, identifying markers and map based ordering would work as well for the duplicated segments as for other sequences. However, if the duplicate regions are collapsed into single scaffolds by the genome assemblers, ALLMAPS would not be able to separate these misassembled regions.
Despite various technical and biological complexities associated with making accurate assemblies and maps, ALLMAPS is designed to incorporate different types of map data including genetic, optical, and synteny data which are expected to have very different unlinked error profiles. For example, while the repetitive sequences or copy number variations (CNVs) might affect the optical map (generating similar restriction fragment patterns) and to a lesser degree also affect the synteny map, the genetic maps are less likely affected by the errors derived from repeats or CNVs. Repetitive sequences and CNVs produce distinctive segregation ratios and would be removed prior to genetic map construction. Conversely, genetic maps do not provide coverage for the non-recombinogenic regions like centromeres and sex chromosomes. This weakness of genetic maps could otherwise be remedied by incorporating other types of maps that are not based on genetic recombination.
We show that ALLMAPS is capable of integrating several lines of mapping evidences to guide the assembly of genomic scaffolds. A key feature of ALLMAPS is its ability to integrate information from multiple maps on the basis of a well-defined objective function to maximize the colinearity score. Each map can be assigned a weight, allowing flexible tuning based on users’ confidence on each input map. ALLMAPS identifies the consensus from several maps, resolves conflicts based on user assigned weights and consolidates these results into a highly consistent scaffold ordering given the available mapping evidence. ALLMAPS can incorporate other types of maps including physical maps, optical maps and comparative maps as well, thus offering a one-stop shop for robust integration of scaffold linkage evidence from a variety of popular mapping technologies, resulting in higher coverage, more accurate, and more replicable chromosome-level genome assemblies.
ALLMAPS objective and algorithm overview
The colinearity between the array of physical locations on the DNA sequence and the array of map locations can be assessed, given a specific configuration of all scaffolds. Assuming the physical locations are already sorted, we can estimate the colinearity via the length of its longest increasing subsequence within the map locations. An increasing subsequence refers to a subsequence in which the elements are in sorted order, from lowest to highest. In this context, the length of a subsequence is equal to the number of markers. Because the polarity of the linkage group was arbitrarily determined during map construction, we may look for either the longest increasing subsequence (LIS) or longest decreasing subsequence (LDS). Without loss of generality, we define the ‘longest monotonic subsequence’ (LMS) as the larger of LIS and LDS to indicate the degree of colinearity, irrespective of the polarity of the map (Figure 6A).
S - Set of scaffolds
G - Set of maps
w j , where j ϵ G - Weight for the j-th map
M ij , where i ϵ S, j ϵ G - Ordered set of markers on i-th scaffold and j-th map. Each marker carries two positions - physical position on the scaffold and genetic position on the map. Within the set of M ij , markers are ordered according to increasing base pair positions on the scaffold. With a leading ‘minus’ sign, - M ij indicates the same set of markers ordered according to decreasing base pair positions on the scaffold
LMS (M ij ) - Longest monotonic sequence of genetic positions among the ordered markers in M ij . We note that this directly measures the colinearity between the physical positions and the genetic positions.
Clustering of homologous linkage groups
Prior to computing the scaffold configuration in the two phases, we divide the whole problem into several sub-problems, with each sub-problem representing a single chromosome that can be solved independently. We first define the ‘pivot map’ as the map with the largest weight, or the map that occurs first in the user input in case of ties between weights. The pivot map determines the number of chromosomes to appear in the assembly. Consequently, the pivot map should ideally contain the same number of linkage groups as the number of chromosomes in the target organism. We assign ‘homologous’ linkage groups from all input maps into separate clusters, with each cluster seeded by a single linkage group from the pivot map. Each linkage group from a non-pivot map will be assigned to the cluster with which it shared the most number of scaffolds. Once the clustering is completed, each scaffold is assigned to a single best cluster based on the number of markers. To reduce ambiguity, chimeric scaffolds that map well to two or more linkage group clusters should preferably be split prior to the execution of ALLMAPS. To determine whether a scaffold is chimeric, ALLMAPS counts the sum of markers mapped to each linkage group cluster, weighted across all input maps. ALLMAPS alerts the user to the chimeric cases with equally good mappings to two or more linkage clusters and skips them during its computation. As a final step in the preprocessing step, we apply Iglewicz and Hoaglin’s outlier test to remove markers that appear to have erratic genetic positions . These markers are most likely due to genotyping errors or other artifacts when the map was constructed.
Phase 1: Generate an initial scaffold order and orientation
This step is the first phase in finding an optimal configuration of scaffolds. There are two aspects in this phase, finding an optimal ordering and finding the optimal orientations of scaffolds. We solve these two problems separately.
We note that the sign of o (a, b) determines the relative orientation between scaffold a and b. A positive score indicates that scaffold a and b are of the same orientation, a negative score indicates opposite orientation, and zero score indicates undetermined orientations. We store the relative scaffold orientations between all pairs of scaffolds in a square matrix M, with number of rows equal to the number of scaffolds. We then compute the eigenvector y corresponding to the largest eigenvalue of M, also known as the Perron-Frobenius eigenvalue. The signs of the components in y provide an approximate solution to flipping the scaffolds to achieve the highest consistency given all pairwise orientations.
With all pairwise scaffold distances calculated, we seek the ordering that yields the least sum of distances between adjacent scaffolds. This problem is then analogous to the famous ‘traveling salesman problem’ (TSP). We calculate the ordering using CONCORDE, currently considered to be the best TSP solver so far . CONCORDE has been applied to solve a variety of bioinformatics problem including radiation mapping  and prediction of protein functions . CONCORDE runs quickly within seconds in all real-world scaffold anchoring problems that we have tested thus far, making it an ideal solver for an approximate solution in Phase 1.
The initial ordering based on the minimization of the inter-scaffold distance across multiple maps is often close to the final solution that maximizes colinearity, suggesting that these two objectives are largely correlated. However, TSP can still generate sub-optimal solutions that require further tuning. For example, the maps may still be very noisy so that even the markers within a single scaffold may not be collinear, which are then likely to skew the distance calculation. However, since our goal in Phase 1 is to simply cut down the total running time by minimizing the search space for Phase 2, this does not have to yield the exact final solution.
Phase 2: Refine order and orientation using a genetic algorithm
The initial scaffold configuration calculated in Phase 1 is close to the final solution, but is sometimes sub-optimal in the L score. In Phase 2, we apply a Genetic Algorithm (GA) to further refine the order and orientation of the scaffolds. Indeed, we could skip Phase 1 and directly run GA to maximize the L score from the start, but formulating the problem as TSP to compute an initial solution speeds up the GA in Phase 2 so that ALLMAPS can quickly converge on a final solution.
A standard GA strategy operates on a population of ‘individuals’, where each individual represents one possible solution - or in our problem, one possible scaffold configuration. We apply ‘mutation’ and ‘crossover’ operations to introduce changes to the current pool of individuals, while each individual is evaluated with respect to their ‘fitness’ (the L score) after the change. Individuals with high fitness scores are preferentially retained (elitist selection) in each generation.
At the start of GA, all individuals are instantiated with the configuration we computed in Phase 1. We then apply two types of mutations at each generation: ‘inversion’ which randomly selects two points in each solution and reverses the order of the scaffolds in between; ‘insertion’ which randomly translocates a scaffold and inserts it next to another randomly selected scaffold. These two mutation operators represent both large-scale and small-scale changes. For crossover operator, we use the ‘Partially Mapped Crossover’ (PMX) function that was shown to speed up convergence . The overall GA scheme is configured with mutation and crossover probability of 0.2 and 0.7, respectively, which were selected to offer a rapid convergence out of a range of test values. The population size is set at 100, and is allowed to evolve until there is no change of best solution in the last 1,000 generations as convergence criteria. These options can be adjusted at runtime. Generations beyond 1,000 can further ensure that the solution converges to the optimal, but also lead to longer running time. To further boost performance, we used parallel processing during the fitness evaluation which is the most time-consuming step during GA .
After the order is changed in one round of GA, we iterate through the scaffolds and check if flipping of any scaffold would increase in the final score. The scaffold is then flipped if an improvement is possible. The order and orientation of scaffolds are two intertwined problems with the results of one affecting the other throughout the computation. If the orientations are changed, then the order among the scaffolds needs to be re-optimized. Similarly, if the order is changed, then the orientations need to be re-optimized. Therefore, several rounds of refinement are sometimes necessary before converging on a final solution. Each round is consisted of optimization of scaffold order followed by optimization of the orientations, until no further improvement in score can be made (Figure 7).
Genome release, summary statistics, and visualization
After computing the scaffold configuration, ALLMAPS proceeds to build the genome assembly. The scaffolds are concatenated together according to the computed order and orientations. Pseudomolecule sequences are constructed with 100Ns (configurable) padded between the scaffolds to represent inter-scaffold gaps. Three outputs - FASTA, AGP, and CHAIN files are the key outputs of the chromosomal assemblies. The FASTA file contains the nucleotides of the pseudomolecules and unanchored sequences. The AGP file is a standard file format to describe assembly of the pseudomolecules from the scaffolds, specifying both the order and orientations. The header of the AGP file contains metadata of the operation, tracking input files and parameters when running ALLMAPS. The CHAIN file, when used in conjunction with the UCSC ‘liftOver’ tool, allows easy conversion from scaffold coordinates to final chromosome-level coordinates . This can be very useful, for example, when gene features were initially predicted on the scaffolds but need to be transferred onto the chromosomes. A byproduct of the liftOver procedure is a consensus map constructed from all input maps. This consensus map is constructed via a common physical order along the reconstructed chromosomes and therefore different from other implementations of consensus mapping [14,42].
A summary report is provided to the user during the genome build, with important statistics such as the number of scaffolds anchored, the number of big (N50) scaffolds anchored, total number of markers included in the final assembly and total length of sequences. These summary statistics can be useful tools to compare the efficacy of ALLMAPS before and after anchoring. We offer two popular ways of visualizing alignments between the markers and reconstructed chromosomes. The first visualization is the ‘side-by-side’ (also known as ‘parallel coordinates’) alignments between chromosomes and the linkage groups, with connecting lines showing the location of the markers (Figure 2A). This type of plot is helpful to reveal conflicting markers as crossing lines, which is also used in CMAP . The second visualization is scatter plot, where the coordinates of the dots represent the physical locations and the map locations of the markers, for each input map (Figure 2B). The scatter plots are a good visualization for illustrating the monotonic trend as well as revealing breaks in colinearity. Through visual inspection of these plots, it is possible to quickly assess the concordance between the final ordering of scaffolds against each of the input maps used in the assembly.
Gap size estimation
We thank the Fujian provincial government for a Fujian ‘100 Talent Plan’ award to HT. EL is supported by the Gordon and Betty Moore Foundation grant number 3383. iPlant is supported by the National Science Foundation under grant numbers #DBI-0735191 and #DBI-1265383. We also thank Andy Edmonds and Andre Mercer for their assistance in the deployment of ALLMAPS on iPlant Discovery Environment.
- Hunt M, Newbold C, Berriman M, Otto TD. A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 2014;15:R42.PubMed CentralPubMedView ArticleGoogle Scholar
- Pop M, Kosack DS, Salzberg SL. Hierarchical scaffolding with Bambus. Genome Res. 2004;14:149–59.PubMed CentralPubMedView ArticleGoogle Scholar
- Sharma SK, Bolser D, de Boer J, Sonderkaer M, Amoros W, Carboni MF, et al. Construction of reference chromosome-scale pseudomolecules for potato: integrating the potato genome with genetic and physical maps. G3. 2013;3:2031–47.PubMed CentralPubMedView ArticleGoogle Scholar
- Assefa S, Keane TM, Otto TD, Newbold C, Berriman M. ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics. 2009;25:1968–9.PubMed CentralPubMedView ArticleGoogle Scholar
- Howison M, Zapata F, Dunn CW. Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics. 2013;29:2959–63.PubMedView ArticleGoogle Scholar
- Ren Y, Zhao H, Kou Q, Jiang J, Guo S, Zhang H, et al. A high resolution genetic map anchoring scaffolds of the sequenced watermelon genome. PLoS One. 2012;7:e29453.PubMed CentralPubMedView ArticleGoogle Scholar
- Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT. Reordering contigs of draft genomes using the Mauve aligner. Bioinformatics. 2009;25:2071–3.PubMed CentralPubMedView ArticleGoogle Scholar
- Tang H, Krishnakumar V, Bidwell S, Rosen B, Chan A, Zhou S, et al. An improved genome release (version Mt4.0) for the model legume Medicago truncatula. BMC Genomics. 2014;15:312.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhou S, Wei F, Nguyen J, Bechner M, Potamousis K, Goldstein S, et al. A single molecule scaffold for the maize genome. PLoS Genet. 2009;5:e1000711.PubMed CentralPubMedView ArticleGoogle Scholar
- Chamala S, Chanderbali AS, Der JP, Lan T, Walts B, Albert VA, et al. Assembly and validation of the genome of the nonmodel basal angiosperm Amborella. Science. 2013;342:1516–7.PubMedView ArticleGoogle Scholar
- Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–27.PubMed CentralPubMedView ArticleGoogle Scholar
- Nagarajan N, Read TD, Pop M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008;24:1229–35.PubMed CentralPubMedView ArticleGoogle Scholar
- Husemann P, Stoye J. r2cat: synteny plots and comparative assembly. Bioinformatics. 2010;26:570–1.PubMed CentralPubMedView ArticleGoogle Scholar
- Ronin Y, Mester D, Minkov D, Belotserkovski R, Jackson BN, Schnable PS, et al. Two-phase analysis in consensus genetic mapping. G3. 2012;2:537–49.PubMed CentralPubMedView ArticleGoogle Scholar
- Kai W, Nomura K, Fujiwara A, Nakamura Y, Yasuike M, Ojima N, et al. A ddRAD-based genetic map and its integration with the genome assembly of Japanese eel (Anguilla japonica) provides insights into genome evolution after the teleost-specific genome duplication. BMC Genomics. 2014;15:233.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang N, Fang L, Xin H, Wang L, Li S. Construction of a high-density genetic map for grape using next generation restriction-site associated DNA sequencing. BMC Plant Biol. 2012;12:148.PubMed CentralPubMedView ArticleGoogle Scholar
- Arends D, Prins P, Jansen RC, Broman KW. R/qtl: high-throughput multiple QTL mapping. Bioinformatics. 2010;26:2990–2.PubMed CentralPubMedView ArticleGoogle Scholar
- Wu Y, Bhat PR, Close TJ, Lonardi S. Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. PLoS Genet. 2008;4:e1000212.PubMed CentralPubMedView ArticleGoogle Scholar
- VANO JW. Multipoint maximum likelihood mapping in a full-sib family of an outbreeding species. Genet Res. 2011;93:343–9.View ArticleGoogle Scholar
- Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One. 2011;6:e19379.PubMed CentralPubMedView ArticleGoogle Scholar
- Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, et al. The Sorghum bicolor genome and the diversification of grasses. Nature. 2009;457:551–6.PubMedView ArticleGoogle Scholar
- Zhang Q, Chen W, Sun L, Zhao F, Huang B, Yang W, et al. The genome of Prunus mume. Nat Commun. 2012;3:1318.PubMed CentralPubMedView ArticleGoogle Scholar
- Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH. Synteny and collinearity in plant genomes. Science. 2008;320:486–8.PubMedView ArticleGoogle Scholar
- Nguyen N, Hickey G, Zerbino D, Raney B, Earl D, Armstrong J, et al. Building a pangenome reference for a population. In: Sharan R, editor. Research in computational molecular biology, vol. 8394. Berlin: Springer International Publishing; 2014. p. 207–21.View ArticleGoogle Scholar
- Prügel-Bennett A. When a genetic algorithm outperforms hill-climbing. Theor Comput Sci. 2004;320:135–53.View ArticleGoogle Scholar
- Mitchell M, Holland JH, Forrest S. When will a genetic algorithm outperform hill climbing? In: Cowan GTJD, Alspector J, editors. Advances in neural information processing systems 6. San Mateo, CA: Morgan Kaufmann; 1994. p. 51–8.Google Scholar
- Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A. 2011;108:1513–8.PubMed CentralPubMedView ArticleGoogle Scholar
- Genotyping by Next-Generation Sequencing. Patent Application WO-2013/106737-Al. [http://www.lens.org/images/patent/WO/2013106737/A1/WO_2013_106737_A1.pdf]
- Sakamoto T, Danzmann RG, Gharbi K, Howard P, Ozaki A, Khoo SK, et al. A microsatellite linkage map of rainbow trout (Oncorhynchus mykiss) characterized by large sex-specific differences in recombination rates. Genetics. 2000;155:1331–45.PubMed CentralPubMedGoogle Scholar
- ALLMAPS wiki page. [https://github.com/tanghaibao/jcvi/wiki/ALLMAPS]
- Howard JT, Koren S, Phillippy A, Zhou S, Schwartz D, Schatz M, et al. De novo high-coverage sequencing and annotated assemblies of the budgerigar genome. Gigascience. 2014;3:11.PubMed CentralPubMedView ArticleGoogle Scholar
- Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience. 2013;2:10.PubMed CentralPubMedView ArticleGoogle Scholar
- Young ND, Debelle F, Oldroyd GE, Geurts R, Cannon SB, Udvardi MK, et al. The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature. 2011;480:520–4.PubMed CentralPubMedView ArticleGoogle Scholar
- Rajaraman A, Tannier E, Chauve C. FPSAC: fast phylogenetic scaffolding of ancient contigs. Bioinformatics. 2013;29:2987–94.PubMedView ArticleGoogle Scholar
- Alkan C, Sajjadian S, Eichler EE. Limitations of next-generation genome sequence assembly. Nat Methods. 2011;8:61–5.PubMed CentralPubMedView ArticleGoogle Scholar
- Iglewicz B, Hoaglin D. How to detect and handle outliers. Milwaukee, WI: ASQC Quality Press; 1993.Google Scholar
- Mulder SA, Wunsch 2nd DC. Million city traveling salesman problem solution by divide and conquer clustering with adaptive resonance neural networks. Neural Netw. 2003;16:827–32.PubMedView ArticleGoogle Scholar
- Agarwala R, Applegate DL, Maglott D, Schuler GD, Schaffer AA. A fast and scalable radiation hybrid map construction and integration strategy. Genome Res. 2000;10:350–64.PubMed CentralPubMedView ArticleGoogle Scholar
- Johnson O, Liu J. A traveling salesman approach for predicting protein functions. Source Code Biol Med. 2006;1:3.PubMed CentralPubMedView ArticleGoogle Scholar
- Fortin F, De Rainville F-M, Gardner MA, Parizeau M, Gagné C. DEAP: evolutionary algorithms made easy. J Mach Learn Res. 2012;13:2171–5.Google Scholar
- Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, et al. The UCSC genome browser database: update 2006. Nucleic Acids Res. 2006;34:D590–8.PubMed CentralPubMedView ArticleGoogle Scholar
- Mester DI, Ronin YI, Korostishevsky MA, Pikus VL, Glazman AE, Korol AB. Multilocus consensus genetic maps (MCGM): formulation, algorithms, and results. Comput Biol Chem. 2006;30:12–20.PubMedView ArticleGoogle Scholar
- Fang Z, Polacco M, Chen S, Schroeder S, Hancock D, Sanchez H, et al. cMap: the comparative genetic map viewer. Bioinformatics. 2003;19:416–7.PubMedView ArticleGoogle Scholar
- Rezvoy C, Charif D, Gueguen L, Marais GA. MareyMap: an R-based tool with graphical interface for estimating recombination rates. Bioinformatics. 2007;23:2188–9.PubMedView ArticleGoogle Scholar
- Voigt C, Moller S, Ibrahim SM, Serrano-Fernandez P. Non-linear conversion between genetic and physical chromosomal distances. Bioinformatics. 2004;20:1966–7.PubMedView ArticleGoogle Scholar
- ALLMAPS web-based interface in the iPlant Discovery Environment. [https://pods.iplantcollaborative.org/wiki/display/DEapps/ALLMAPS]
- Yellow catfish genome sequencing data on Sequence Read Archive. [http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP050322]
- ALLMAPS supporting data: Yellow catfish genome assembly. [http://dx.doi.org/10.6084/m9.figshare.1057746]
- ALLMAPS supporting data: Medicago genome assembly. [http://dx.doi.org/10.6084/m9.figshare.1057745]
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.