Skip to main content

Table 1 Benchmark datasets

From: scIBD: a self-supervised iterative-optimizing model for boosting the detection of heterotypic doublets in single-cell chromatin accessibility data

Data source

Dataset identity

Dataset type

Number of droplets

Doublet rate

Med. of sequencing depth

Med. of sparsity

Number of peaks

Number of cell types

Complexity (silhouette coefficient)

Valid read pairs

Avg. valid read pairs per droplet

Med. valid read pairs per droplet

MF [19] (GSE100033)

SimATAC-balance

Fully-synthetic

4500

0.11

1172

0.9916

140,102

8

0.66

-

-

-

SimATAC-imbalance

2100

0.2

1291

0.9908

140,102

8

0.56

-

-

-

HMC [6] (GSE162690)

High-loading

Real

20,663

0.29

1102

0.9941

187,075

10

0.31

175,950,050

8515

7510

Low-loading

12,793

0.21

1004

0.9940

166,360

10

0.42

77,050,576

6022

4641

MF [19] (GSE100033)

Forebrain*

Semi-synthetic

1298

0.2

1930

0.9915

226,759

5

0.23

40,227,305

30,992

25,004

MCA [20] (GSE111586)

Bone marrow*

5244

0.2

2120

0.9917

254,413

18

-0.22

173,189,105

33,026

20,842

Lung*

6146

0.2

1406

0.9918

171,002

22

-0.08

130,550,777

21,241

14,293

Whole brain*

6494

0.2

2404

0.9873

189,933

21

0.01

255,248,909

39,305

21,679

Cerebellum

2733

0.2

1465

0.9934

221,657

20

0.04

53,152,804

19,448

10,464

Heart

9180

0.2

2029

0.9886

178,104

22

-0.30

236,918,887

25,808

15,143

Kidney

7717

0.2

2702

0.9858

190,690

26

-0.24

215,602,830

27,938

19,557

Prefrontal cortex

7150

0.2

4863

0.9765

206,546

22

-0.19

307,767,321

43,044

27,204

Spleen

4824

0.2

2439

0.9845

157,363

15

-0.02

128,524,323

26,620

19,567

Islets [16] (GSE165212)

Islet1

5076

0.2

2494

0.9830

147,104

5

0.16

102,622,313

15,495

18,966

Islet2

5145

0.2

2778

0.9804

138,549

5

0.13

101,595,484

17,755

12,137

10xPBMC [21, 22] (10xGenomics)

PBMC

9732

0.2

3112

0.9667

93,446

17

0.25

316,811,190

32,553

32,535

  1. Asterisk (*) symbol implies that there are two sequencing replicates of this tissue: one is used to reserve singlets; the other one is used to generate doublets. The doublet rate values of semi-synthetic datasets indicate the simulated ratio based on the number of singlets, not the actual doublet rate of the dataset. The fully-synthetic datasets were directly simulated based on the count matrices, thus the read statistics are not available