A set theory based similarity measure for text clustering and classification

Journal of Big Data

Table 12 Performance evaluation of all measures when, NF = the whole size (Reuters = 18308, web-kb = 33025 features)—the averaged results (K = 1–120; +2)

No	Dataset	Reuters						Web-KB
No	Similarity/criterion	ACC	PRE	REC	FM	GM	AMP	ACC	PRE	REC	FM	GM	AMP
1	Euclidean	0.615	0.716	0.294	0.344	0.510	0.271	0.550	0.768	0.428	0.422	0.591	0.384
2	Cosine	0.897	0.903	0.716	0.767	0.836	0.650	0.766	0.803	0.670	0.715	0.799	0.614
3	Jaccard	0.865	0.813	0.550	0.601	0.730	0.488	0.786	0.859	0.684	0.694	0.792	0.619
4	Bhattacharya	0.888	0.867	0.683	0.693	0.818	0.590	0.534	0.689	0.526	0.434	0.667	0.406
5	kullback–Leibler	0.503	0.164	0.128	0.089	0.335	0.128	0.134	0.079	0.248	0.090	0.431	0.250
6	Manhattan	0.527	0.376	0.162	0.144	0.373	0.158	0.429	0.669	0.294	0.220	0.474	0.284
7	PDSM	0.909	0.899	0.700	0.745	0.827	0.631	0.801	0.854	0.714	0.727	0.812	0.642
8	STB-SM	0.912	0.913	0.739	0.783	0.851	0.676	0.791	0.841	0.706	0.715	0.806	0.630

Italic values indicate the highest values that top measures achieved for corresponding evaluation metrics