Skip to main content

Table 7 Comparison of the estimated number of clusters produced by different methods on the real-world datasets

From: An ensemble method for estimating the number of clusters in a big data set using multiple random samples

Dataset

K

es

nselectboot

kluster

X-means

Elbow

Silhouette

Gap statistic

RSPCE

Covertype

7

5%

2.8 ± 1.2

11.3 ± 0.5

2.1 ± 0.3

7.4 ± 1.2

2.1 ± 0.3

2.2 ± 0.3

4.6 ± 0.9

 

10%

4.0 ± 1.9

10.3 ± 1.0

2.3 ± 0.7

7.3 ± 1.3

2.3 ± 1.5

2.3 ± 0.3

6.0 ± 0.7

 

20%

6.6 ± 3.5

10.7 ± 0.5

2.2 ± 0.4

7.7 ± 0.8

2.2 ± 0.2

2.2 ± 0.2

7.1 ± 0.5

 

30%

8.4 ± 5.5

10.7 ± 1.0

2.2 ± 0.4

8.0 ± 0.7

2.4 ± 0.4

2.2 ± 0.1

7.8 ± 0.8

 

40%

9.4 ± 5.5

11.3 ± 1.0

2.2 ± 0.4

8.4 ± 1.4

NA

2.1 ± 0.1

8.2 ± 0.8

 

50%

9.8 ± 5.9

11.0 ± 0.6

2.1 ± 0.3

7.3 ± 0.9

NA

2.1 ± 0.1

8.2 ± 0.6

KDD’99ID

23

1%

6.8 ± 6.7

10.6 ± 0.9

53.6 ± 5.9

6.8 ± 0.9

4.3 ± 0.5

48.7 ± 0.6

2.9 ± 0.4

 

2%

5.8 ± 6.3

10.4 ± 0.5

51.2 ± 5.2

4.8 ± 0.7

3.6 ± 0.4

46.5 ± 1.4

3.0 ± 0.0

 

5%

6.8 ± 7.5

10.4 ± 0.5

53.0 ± 4.5

4.8 ± 0.6

3.6 ± 0.5

48.7 ± 0.6

3.0 ± 0.0

 

10%

7.3 ± 8.5

10.2 ± 0.4

53.0 ± 4.5

5.2 ± 0.8

3.4 ± 0.5

46.5 ± 1.7

3.0 ± 0.0

 

15%

8.0 ± 8.7

10.5 ± 0.6

53.5 ± 4.8

4.2 ± 0.8

NA

NA

3.0 ± 0.0

 

20%

7.7 ± 8.1

10.4 ± 0.5

53.0 ± 4.5

4.4 ± 0.5

NA

NA

3.0 ± 0.0

PokerHand

10

1%

3.2 ± 0.6

2.9 ± 0.3

17.8 ± 2.4

6.7 ± 1.7

4.5 ± 0.9

11.4 ± 1.3

3.5 ± 0.5

 

2%

3.4 ± 0.7

3.6 ± 0.4

18.3 ± 2.5

7.2 ± 1.9

4.7 ± 0.5

12.2 ± 0.9

3.7 ± 0.6

 

5%

2.7 ± 0.5

3.5 ± 0.6

17.9 ± 1.9

7.3 ± 1.2

3.9 ± 0.8

11.4 ± 2.3

3.9 ± 0.8

 

10%

2.3 ± 0.6

3.6 ± 0.5

19.4 ± 1.5

6.9 ± 1.8

4.8 ± 0.8

11.7 ± 1.2

3.9 ± 0.4

 

15%

2.4 ± 0.6

3.6 ± 0.4

18.1 ± 1.7

6.6 ± 1.3

NA

NA

4.0 ± 0.0

 

20%

2.1 ± 0.8

3.8 ± 0.3

17.4 ± 1.4

6.8 ± 1.5

NA

NA

4.0 ± 0.0

SUSY

2

1%

2.9 ± 0.4

8.3 ± 0.7

13.4 ± 1.3

4.3 ± 0.8

7.7 ± 0.8

4.5 ± 0.9

2.6 ± 0.6

 

2%

3.2 ± 0.6

8.4 ± 0.5

11.9 ± 2.5

4.6 ± 1.2

6.5 ± 2.1

5.0 ± 1.6

3.2 ± 0.4

 

5%

2.6 ± 0.7

9.2 ± 1.1

13.3 ± 2.1

5.1 ± 0.7

8.1 ± 1.5

5.2 ± 0.8

3.4 ± 0.5

 

10%

2.5 ± 0.5

8.6 ± 0.5

15.1 ± 0.9

6.7 ± 0.9

9.3 ± 1.2

6.1 ± 1.2

3.3 ± 0.5

 

15%

2.9 ± 1.1

9.2 ± 0.8

16.9 ± 1.2

7.7 ± 1.3

NA

NA

3.7 ± 0.4

 

20%

2.8 ± 0.7

9.4 ± 0.6

15.4 ± 1.8

6.5 ± 0.8

NA

NA

3.8 ± 0.4

  1. The average value of 20 runs is displayed together with “±” standard deviation
  2. Sample size for RSPCE is \(n=5000\); K is the true number of classes in the dataset; and es is the ensemble size. The percentage was selected randomly
  3. NA indicates not available, i.e., the value cannot be computed