Skip to main content

Table 2 Details of benchmark datasets used for evaluation and comparison with State-of-art

From: Unsupervised outlier detection in multidimensional data

Dataset

Type

Description

# observations

# dimensions

# Outliers

T48K*

Synthetic

Six multi-shape clusters with two types of noise

8000

2

764

Complex9*

Synthetic

Nine multi-shape clusters with noise

10,000

2

792

Cluto*

Synthetic

Eight multi-shape multi-density clusters with noise

8000

2

323

Arrhythmia**

Real

Patient records: normal vs cardiac arrhythmia

450

259

206

Heartdisease**

Real

Medical data on heart problems: healthy vs sick

270

13

120

Hepatits**

Real

Medical data on hepatitis: patient will die (outliers), survive (inliers)

80

19

13

Parkinson**

Real

Medical data: healthy people vs Parkinson's disease

195

22

147

Spambase40**

Real

Emails classified as spam (outliers) or non-spam

4207

57

1679

Glass**

Real

A forensic dataset describing types of glass

214

7

9

Pendigits**

Real

Different handwriting digits from 0 to 9

9868

16

20

Shuttle**

Real

Space Shuttle Data

1013

9

13

WBC**

Real

Cancer types, benign or malignant

454

9

10

WPBC**

Real

Wisconsin Prognostic Breast Cancer dataset

198

33

47

Pima**

Real

Medical data on diabetes

768

8

268

  1. *Available at: https://github.com/deric/clustering-benchmark
  2. **Available at: https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/