Skip to main content

Table 18 Summary of data sets and class imbalance levels

From: Survey on deep learning with class imbalance

Paper Data sets Data type Class count Data set size Min class size Max class size \(\rho\) (Eq. 1)
[79] CIFAR-10 Image 10 60,000 2340 3900 2.3
[20] WHOI-Plankton Image 103 3,400,000 < 3500 2,300,000 657
[21] Public cameras Image 19 10,000 14 6986 499
[18] CIFAR-100 (1) Image 2 6000 150 3000 20
CIFAR-100 (2) Image 2 1200 30 600 20
CIFAR-100 (3) Image 2 1200 30 600 20
20 News Group (1) Text 2 1200 30 600 20
20 News Group (2) Text 2 1200 30 600 20
[88] COCO Image 2 115,000 10 100,000 10,000
[103] Building changes Image 6 203,358 222 200,000 900
[89] GHW Structured 2 2565 406 2159 5.3
ORP Structured 2 700 124 576 4.6
[19] MNIST Image 10 70,000 600 6000 10
CIFAR-100 Image 100 60,000 60 600 10
CALTECH-101 Image 102 9144 15 30 2
MIT-67 Image 67 6700 10 100 10
DIL Image 10 1300 24 331 13
MLC Image 9 400,000 2600 196,900 76
[90] KEEL Structured 2 3339 26 3313 128
[91] CIFAR-10 Image 10 60,000 250 5000 20
CIFAR-100 Image 100 60,000 25 500 20
[22] CelebA Image 2 160,000 3200 156,800 49
[117] MNIST Image 10 60,000 50 5000 100
MNIST-back-rot Image 10 62,000 12 1200 100
CIFAR-10 Image 10 60,000 5000 5000 1
SVHN Image 10 99,000 73 7300 100
STL-10 Image 10 13,000 500 500 1
[118] CelebA Image 2 160,000 3200 156,800 49
[92] EmotioNet Image 2 450,000 45 449,955 10,000
[23] MNIST Image 10 60,000 1 5000 5000
CIFAR-10 Image 10 60,000 100 5000 50
ImageNet Image 1000 1,050,000 10 1000 100
  1. Images from CelebA and EmotioNet are treated as a set of binary classification problems, because they are each annotated with 40 and 11 binary attributes, respectively. The COCO data class imbalance arises from the extreme imbalance between background and foreground concepts