Skip to main content
Fig. 3 | Genome Biology

Fig. 3

From: GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species

Fig. 3

The performance comparison among GBC and alternative methods (details in Additional file 1: Tables S1-S5). a The basic performance of different methods on the UKBB exome chr4 dataset. The “standardized performance ratio” was obtained by scaling the GBC’s result to 1 and all other results against the GBC’s result. b The basic performance of different methods on the 1000GP3 and SG10K. c The compression speed (upper region) and decompression speed (lower region) of GBC and other alternative methods with the increase of sample size on simulated datasets. d The significant improvement of compression and decompression speed under multi-threads on the 1000GP3 dataset. e Retrieved the genotypes of random variants for all the subjects on the SG10K dataset. The option of accessing genotypes of variants in multiple regions at a time is only provided by BCFtools, GBC, and Genozip. Thus, the time cost was estimated by the access time of individual sites for methods including GTC, PBWT, and BGT. Genozip throws an exception when accessing 994,485 and 9,944,848 variants. f Retrieved a range of continuous variants for all the subjects on the Simulation 5000 K and SG10K-chr2 (the genotypes on chromosome 2 of the SG10K dataset) datasets. BGT, PBWT, GTC, and Genozip failed to compress the simulation 5000 K dataset. g Retrieved all the variants for a specified subset of subjects on the SG10K-chr2 dataset. h Retrieved all the variants for a specified subset of subjects on the Simulation 500 K dataset. i Filtered out the variants by alternative allele frequency on the SG10K-chr2 dataset. j Retrieved continuous variants and random variants in ordered and unordered SG10K dataset separately. k The comparison of LD coefficients computational speed between GBC and other popular tools on the 1000GP3 and SG10K datasets. l GBC speeds up follow-up computation (calculating the pair-wise linkage disequilibrium coefficients as an example) through I/O optimization. m Concatenate the chromosome-separated files within each dataset. n Split compressed archives by chromosome. o Merge multiple compressed archives with non-overlapping subjects. p Retrieve all the genotypes for a specified subset of subjects and rebuild the compressed archives. We tested the time cost of fetching different sizes of subject subsets on the simulation 500 K dataset. q Sort the variants by coordinate. We used several disordered simulated data with 100,000 subjects for evaluating the time cost and measured the range of speed ratio of GBC to BCFtools according to the disordered degree of datasets

Back to article page