Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation

Fig. 1

Features of get_pangenes.pl. A Flowchart of the main tasks and deliverables of script get_pangenes.pl: cutting cDNA and CDS sequences (top), calling collinear genes (middle, panels B and C) and clustering (bottom, panel D). By default, only cDNA and CDS sequences longer than 100 bp are considered. Whole genome alignments (WGA) can be computed with minimap2 (default) or GSAlign, and the input genomes can optionally be split in chromosomes or have their long geneless regions (> 1 Mbp) masked. Resulting gene clusters contain all isoforms and are post-processed to produce pangene and percentage of conserved sequences (POCS) matrices, as well as to estimate pan-, soft-core-, and core-genomes. GSAlign also produces average nucleotide identity (ANI) matrices. Several tasks can be fine-tuned by customizing an array of parameters, of which alignment coverage is perhaps the most important. B WGA of genomes A and B produces BED-like files that are intersected with gene models from B. Intersected coordinates are then used to transform B gene models to the genomic space of A. Finally, overlapping A gene models on the same strand are defined as collinear genes. C Feature overlap is computed from WGAs and gene coordinates from source GFF files. When checking the overlap of A and B gene models, strandedness is required. Overlaps can also be estimated between gene models annotated in one assembly and matched genomic segments from others. D Making greedy clusters by merging pairs of collinear genes. This algorithm has a key parameter, the maximum distance (in genes) among sequences of the same species that go in a cluster (default = 5). Its effect is illustrated on the right side, where gene g34 is left unclustered for having too many intervening genes

Back to article page