Skip to main content

Table 1 Characteristics of the datasets (d: dimensions, N: number of objects, K: number of clusters or classes)

From: An ensemble method for estimating the number of clusters in a big data set using multiple random samples

Dataset

d

Cluster/class sizes

N

K

DS1

2

3 clusters; each has 50,000

1,000,000

10

  

4 clusters; each has 100,000

  
  

3 clusters; each has 150,000

  

DS2

2

7 clusters; each has 25,000

1,000,000

20

  

1 cluster; has 30,000

  
  

1 cluster; has 45,000

  
  

6 clusters; each has 50,000

  
  

3 clusters; each has 75,000

  
  

1 cluster; has 100,000

  
  

1 cluster; has 125,000

  

DS3

10

3 clusters; each has 15,000

1,000,000

30

  

3 clusters; each has 20,000

  
  

11 clusters; each has 25,000

  
  

1 cluster; has 30,000

  
  

2 clusters; each has 35,000

  
  

4 clusters; each has 40,000

  
  

3 clusters; each has 50,000

  
  

1 cluster; has 60,000

  
  

2 clusters; each has 75,000

  

DS4

10

5 clusters; each has 5000

1,000,000

40

  

2 clusters; each has 10,000

  
  

4 clusters; each has 15,000

  
  

4 clusters; each has 20,000

  
  

16 clusters; each has 25,000

  
  

1 cluster; has 35,000

  
  

1 cluster; has 40,000

  
  

5 clusters; each has 50,000

  
  

1 cluster; has 60,000

  
  

1 cluster; has 75,000

  

DS5

10

1 cluster; has 10,000

1,000,000

50

  

19 clusters; each has 15,000

  
  

14 clusters; each has 20,000

  
  

13 clusters; each has 25,000

  
  

1 cluster; has 30,000

  
  

2 clusters; each has 35,000

  

Covertype

54

211,840 : 283,301 : 35,754 :

581,012

7

  

27,747 : 8483 : 17,367 : 20,510

  

KDD’99ID

41

972,781 : 2,807,886 : 1,072,017 :

4,940,000

23

  

53 : 979 : 264 : 21 : 2203 : 8 : 12 :

  
  

2 : 1020 : 20 : 7 : 4 : 30 : 9 : 10 :

  
  

3 : 12,481 : 15,892 : 2316 : 10,413

  

PokerHand

10

513,702 : 433,097 : 48,828: 3978 :

1,025,010

10

  

21,634 : 2050 : 1460 : 236 : 17 : 8

  

SUSY

18

2,712,173 : 2,287,827

5,000,000

2

  1. In the four real-world datasets, the K values in the middle column are the numbers of objects in K classes
  2. The small classes are underlined