A set theory based similarity measure for text clustering and classification

Amer, Ali A.; Abdalla, Hassan I.

doi:10.1186/s40537-020-00344-3

Research
Open access
Published: 14 September 2020

A set theory based similarity measure for text clustering and classification

Journal of Big Data volume 7, Article number: 74 (2020) Cite this article

8427 Accesses
22 Citations
3 Altmetric
Metrics details

Abstract

Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency.

Introduction

In information retrieval and machine learning, a good number of techniques utilize the similarity/distance measures to perform many different tasks [1]. Clustering and classification are the most widely-used techniques for the task of knowledge discovery within the scientific fields [2,3,4,5,6,7,8,9,10]. On the other hand, text classification and clustering have long been vital research areas of information retrieval (IR). While text classification is the process of classifying the text/document into its actual class by utilizing a similarity measure and a proper classifier. The clustering, on the other hand, is the process of grouping similar texts into similar groups called clusters. As a matter of fact, with the ever-piling amount of data and information on the internet, the necessity for a highly effective classification algorithm is urgent. Nevertheless, the enhancement of classification performance has still been the main task for researchers in the text mining field. Given the fact that the similarity/distance measures are the core component of the classification and clustering algorithm, their efficiency and effectiveness directly impact techniques’ performance in one way or another. Therefore, the selection of the best similarity measure for the techniques in question is still an open-ended challenging task.

Even though there have been several proposed works in IR literature to compare the similarity/distance measures for clustering and classification purposes [2, 3, 11,12,13,14,15,16], these studies are still incapable of providing a comprehensive preview of the actual performance of similarity measures. Besides, some of those works have presented an efficient similarity measure while ignoring effectiveness [17, 21]. While the others have presented only an effective similarity measure while ignoring efficiency [2,3,4]. Consequently, this work comes to cover this critical limitation by introducing a compromised (effective and time-efficient) similarity measure while the most widely used similarity measures are elegantly investigated in a thorough pattern under numerous circumstances. Using the K-nearest neighbor classifier (KNN), K-means clustering algorithm, and the bag of words (BoW) representation model [17,18,19] for feature selection, the similarity measures are examined in details. The K values (in KNN) is varied from (1) to (120) and the number of features is set to be in (50, 100, 300, 1000, 3000, 6000, and the whole number of features of the considered dataset). In doing so, the superiority of STB-SM measure is emphasized, and each measure is tested under several circumstances so the desired effectiveness including accuracy is being obtained in certain K values on several features. These measures are evaluated against low dimensional datasets (by studying their performance on 50, 100, 200, and 350) and high dimensional datasets (by studying their performance on 3000, 6000, and the number of all features of the dataset). The measures’ behavior has been analyzed to determine which measure gives the best results in certain K values on a specific number of features. Furthermore, for the clustering performance analysis, five evaluation metrics were employed with two of them are internal and three are external. The key objective of this work is to present a new competitive measure, compare and benchmark the similarity measures performance on the targeted datasets on both the low and the high-dimensional datasets. Briefly, the main contributions of this work are listed below:

1.
Introducing a novel similarity measure for text retrieval that basically behaves based on the set theory mechanism. This measure has been named a set theory based similarity measure for text retrieval (STB-SM). In accordance with the experimental results of both classification and clustering, STB-SM has been shown to be a promising measure with its being superior over the existing state-of-the-art measures.
2.
Along with proposing the STB-SM, seven similarity measures, that are commonly applied for text retrieval and machine learning purposes, are thoroughly investigated and evaluated to benchmark their impact on text retrieval. They are comprehensively tested on two of the most publicly available datasets (namely, web-KB and Reuters-21). Using BoW, a thorough comparative analysis for these measures, in terms of their effectiveness and efficiency, are drawn. While the classification effectiveness includes six evaluation factors, namely; accuracy, precision (PRE), recall (REC), F-Measure (FM), G-Measure (GM) and Average Precision Mean (AMP). The clustering effectiveness includes five evaluation metrics namely, Purity, Completeness and Rand Index as the external metrics, along with Calinski-Harabasz index and Davies-Bouldin index as the internal metrics. Moreover, for both classification and clustering efficiency, the run time, taken by each measure to find the similarity degree, is rigorously observed.
3.
The scope of this work concentrates on promoting the performance of text clustering and classification through a new measure along with a detailed comparative analysis for the proposed measure against the state-of-art BoW-based similarity measures. The drawn analyses would provide an influential guide for the selection of similarity measures in terms of considered datasets as well as helping researchers in fully understanding the present and future challenges linked with text retrieval.

The rest of this paper is structured as follows: the most relevant similarity measures for this study are concisely presented in Sect. “Related work”. Section “The set theory” briefly describes the basics and definitions of set theory in the context of text retrieval. Section “The proposed similarity measure (STB-SM)” defines, formulates, and analyzes the proposed similarity measure in the context of the set theory. The experimental setup is drawn in Sect. “Experimental setup”. The results of the work are given in Sect. “Experimental results”. The discussion is profoundly detailed in Sect. “Discussion”. Finally, conclusions and future work recommendations are presented in Sect. “Conclusions and future work”.

Related work

Vector Space Model (VSM) has long been used to represent document(s) when dealing with text retrieval. In VSM, each document is drawn as an N-dimensional vector. Each dimension represents a vocabulary term/feature. In information retrieval (IR) literature, there are a good number of similarity measures to compute the pairwise document similarity using VSM. While there have been some works that have been proposed in the IR literature to perform the clustering along with the classification using the similarity/distance measures [2,3,4, 11,12,13,14,15,16]. These works lack the comprehensive preview of the actual performance of similarity measures. Moreover, some of them have proposed efficient similarity measures irrespective of their effectiveness [21, 22]. Other works, however, have presented only effective similarity measures without consideration to their efficiency [2,3,4].

Euclidean and Manhattan distances are among the most famous geometric measures which have been utilized to find the distance between each vector pair [2, 20]. Similarly, Cosine similarity finds similarity between each document pair using the angle between their vectors [10]. The triangle distance is also looked at as the Cosine of a triangle between vector pair [10]. The value of this measure range between 0 and 2. On the other hand, for 0–1 vectors, the Hamming distance [4] is used to give the number of positions at which the feature weights are not equal. Kullback–Leibler divergences [23, 24], KLD, as a non-symmetric measure was used in [24] to compute the similarity between each vector pair using the probability distribution that is associated with the both vectors. In [4], a similarity measure for text processing, named SMTP, was found to calculate the similarity between document pair. An Information-Theoretic measure (IT-Sim), was proposed based on information theory in [18] for document Similarity purposes. In [3], a new similarity measure called Improved Sqrt-Cosine (ISC) was proposed. Meanwhile, Bhattacharya coefficient was invented in [21] to approximately calculate the overlap rate between each statistical sample pair. Jaccard coefficient was developed in [25] to find similarity using the ratio of the number of features existing in both documents to the number of features existing in at least one of them. Subsequently in [2], a new similarity measures named pairwise document similarity measure based on present term set (PDSM), was presented based on the feature weights as well as the number of features that existed in at least one of the considered documents.

Some of these measures have shown to be highly effective such as the PDSM [2], the ISC [3], and the SMTP [4], yet unfortunately time-inefficient. In contrast, some measures are not effective yet highly efficient notably the Euclidean and Manhattan. Cosine, on the other hand, has been seen as a compromised solution as an effective and highly efficient measure. Furthermore, as reported in IR literature, almost all of these measures were tested in the context of text classification and clustering. For example, PDSM was compared in [2] with five similarity measures in terms of classification and near duplicate application. Likewise, ISC [3] and SMTP [4] were evaluated against several similarity measures concerning text classification and clustering. Similarly, our proposed paper of this work has been evaluated against some of the most widely used similarity measures in machine learning and information retrieval literature, particularly with respect to text classification and clustering. Finally, [7] assessing the clustering performance of several measures on three collections of web documents. The experimental results of their experiment revealed that Cosine similarity outweighs both the Jaccard coefficient and the Euclidean distance.

The most relevant similarity measures

In this sub-section, the similarity measures that are considered to conduct this study are presented. Seven similarity measures are introduced as the most widely used measures for text clustering and classification [2, 20,21,22,23,24]. These similarity measures work by considering the terms’ presence and absence, or by evaluating the angle between each vector pairs or by finding the distance. Assuming that we have two documents doc1 and doc2 that have two vectors d1 and d2, the aim is to find how much similarities are there when using the intended similarity measure as follows;

Euclidean distance (ED)

Every document is drawn as a point in 2D space depending on the term frequency of N terms that would represent the N dimension. ED finds the similarity between each point pair in N-dimensional space using their coordinate based on the following equation:

$$D_{Euc} \left( {doc1,doc2} \right) = \sum \sqrt {(doc_{11 - } doc_{12} )^{2} + (doc_{21 - } doc_{22} )^{2} + \ldots (doc_{n1 - } doc_{n2} )^{2} }$$

(1)

Manhattan

Manhattan distance (known as sum-norm) finds the sum of absolute differences between the targeted coordinates of each document pair vectors as follows:

$$Manhattan - distance \left( {doc1, doc2} \right) = \mathop \sum \limits_{i = 1}^{n} \left| {doc1_{w1} - doc2_{w2} } \right|$$

(2)

Cosine similarity measure

The Cosine similarity calculates the pairwise similarity between the document pairs using the dot product and the magnitude of both vectors of both documents. It is mostly utilized within the scientific fields including the IR field [20], and is defined as follows:

$$Sim_{Cos} \left( {doc1,doc2} \right) = \frac{{\mathop \sum \nolimits_{i = 1}^{n} (doc_{i1} * doc_{i2} )}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} doc_{i1}^{2} } *\sqrt {\mathop \sum \nolimits_{i = 1}^{n} doc_{i2}^{2} } }}$$

(3)

The union is used to normalize the inner product.

Jaccard similarity measure

This coefficient was invented in [25] to divide the intersection of the points by their unions, and the value of coefficient ranges between 0 (there is no similarity between the documents) and 1 (both documents are identical). The Jaccard similarity is given by the next equation:

$$Sim_{jaccard} \left( {doc1,doc2} \right) = \frac{doc1 \cap doc2}{{doc1{\bigcup }doc2}}$$

(4)

Bhattacharya coefficient

The Bhattacharyya coefficient is used to approximately calculate the overlap rate between each statistical sample pair [21]. In our works, however, these samples are thought of as documents. This coefficient is being utilized to find the approximate closeness of each document pair.

$$Sim_{Bhatta} \left( {doc1,doc2} \right) = 1 - \log \left( {\sum\nolimits_{i = 1}^{n} {\sqrt {doc_{i1 * } doc_{i2} } } } \right)$$

(5)

Kullback–Leibler divergence

It is also known as a “relative entropy” [23, 24]. It is used to measure the difference between probability distributions. Simply, when this measure reaches 0, it signals that the intended distributions pair is identical, following that, its equation is then drawn as follow;

$$Sim_{KL} \left( {doc1,doc2} \right) = \mathop \sum \limits_{i = 1}^{n} (doc_{i1} )*{ \log }\left( {\frac{{doc_{i1} }}{{doc_{i2} }}} \right)$$

(6)

PDSM

This measure has been introduced in [2] to tackle the limitation of the-state-of-art measures which included a number of present terms into account. PDSM was seen effective according to the experimental results of [2] as well as the experimental results of our current work. The PDSM equation is formulated as follows

$$D_{pdsm} \left( {doc1,doc2} \right) = \frac{{doc_{i1} \cap doc_{i2} }}{{doc_{i1} Udoc_{i2} }}*\frac{{PF\left( {doc_{i1} doc_{i2} } \right)}}{{M - AF\left( {doc_{i1} ,doc_{i2} } \right) + 1}}$$

(7)

where

$$doc_{i1} \cap doc_{i2} = { \hbox{min} }\left( {doc_{i1} , doc_{i2} } \right)$$

$$doc_{i1} Udoc_{i2} = { \hbox{max} }\left( {doc_{i1} , doc_{i2} } \right)$$

where $PF(doc_{i1} doc_{i2} )$ represents the number of present terms and $AF(doc_{i1} doc_{i2} )$ represents the number of absent terms and M is the total number of documents.

The set theory

Before introducing the proposed measure, some basics and definitions (upon which our measure behaves) for the set theory in the context of text retrieval should be conceived. So, in this section, the main objective is to introduce the relative set theory operations upon which our proposed measure behaves.

Generally speaking, the set theory is a vital component of modern mathematics and is widely used in all formal descriptions. The set can be a collection, a group, or even a cluster of points that are named members of that set. For instance, a set of documents is a collection of documents, or a set of people is a group of people, etc. For each point to be a member of that set, its membership shall be defined clearly. However, sometimes, due to the lack of information, membership definition is a difficult task and may even be a vague. So, if the membership definition is vague for some collection, the collection is then cannot be called a set. Simply put, if there has been a set S and its two members X and Y, then it shall not be unknown whether X = Y or they are not. Strictly speaking, the set can be either finite, infinite, or empty. In the following, some basic definitions and key operations are introduced to further understand the basics upon which STB-SM measure behaves.

Definition 1

If we have two sets S1 and S2, both sets are equal if and only if they have the same points, and then every $X \in S1 \Leftrightarrow X \in S2.$ For example, in the context of text retrieval, if we have Doc1{Ali, Jun, Sarah} and Doc2{Jun, Sarah, Ali}. Then, we can say that Doc1 = Doc2, and they are both identical as every word belongs to Doc1 also belongs to Doc2.

Definition 2

If we have two sets S1 and S2, S1 is “a proper” subset of S2 (S1 $\subseteq$ S2) if there has been X $\in$ S1 and also X $\in$ S2 as well. For example, in the context of text retrieval, if we have Doc1{Ali, Hassan, Sarah} and Doc2{Hassan, Sarah, Ali, Mark, Farah}. Then, we can say that Doc1 $\subseteq$ Doc2, and Doc1 is a proper subset of S2 as every word belongs to Doc1 also belongs to Doc2.

Definition 3

he document doc is a collection of terms of vectors that holds these terms, that is, any subset of C, when C is the document collection, (involving C itself).

Let doc be a document, a subset of C. We say that doc exists as a vector if the terms of doc exist in the doc itself. First, let us define the key relationships between each document pair doc1 and doc2 in the collection C, as follows;

$$doc1 \subset doc2 \Leftrightarrow T \in doc1 \Rightarrow T \in doc2 \left( {containment} \right)$$

$$doc1 = doc2 \Leftrightarrow doc1 \subset doc2 and doc2 \subset doc1 \left( {equality} \right)$$

So, for the given document pair doc1 and doc2, the following set of operations are held as follows;

Operation 1—union

The union of two sets S1 and S2 (S1 $\cup$ S2), is the set that contains all the elements of both sets S1 and S2 with the removal of duplication.

$$S1 \cup S2 = \left\{ {X |X \in S1 or X \in S2} \right\}$$

In the context of text retrieval, the Union operation of doc1 and doc2, doc1 $\cup$ doc2, is the group of terms {t₁,…, t_n} where n is the number of addressed terms in both documents, that are involved in either doc1, doc2 or both:

$$doc1 \cup doc2 = \left\{ {t : t \in doc1 {\text{or }}t \in doc2} \right\}.$$

Operation 2—intersection

The Intersection of two sets S1 and S2 (S1 $\cap$ S2), is the set that contains shared elements of sets S1 and S2.

$$S1 \cap S2 = \left\{ {X |X \in S1 and X \in S2} \right\}$$

In the context of text retrieval, the Intersection operation of doc1 and doc2, doc1 $\cap$ doc2, is the group of terms {t₁,…, t_n} where n is the number of addressed terms in both documents, that are involved in both documents doc1 and doc2 at the same time:

$$doc1 \cap doc2 = \left\{ {t : t \in doc1 {\text{and }}t \in doc2} \right\}.$$

Operation 3—negation

The negation operation of doc1 or doc2, doc1/doc2 or doc2/doc1, is the group of terms that are either belongs to doc2/doc1 or doc1/doc2:

$$doc1 \backslash doc2 = \left\{ {t:t \notin doc2} \right\}.$$

$$doc2 \backslash doc1 = \left\{ {t:t \notin doc1} \right\}.$$

The proposed similarity measure (STB-SM)

The formulation of STB-SM similarity measure

Suppose we have a document pair doc 1 and doc2. Let doc1 = (w₁₁, w₁₂,…) and doc2 = (w₂₁, w₂₂,…) be the weighting vectors (using BoW model) of the term sets for document 1 and document 2, respectively. Let T₁ {t₁₁, t₁₂,… t_1n} and T₂ {t₂₁, t₂₂,… t_2n} be the sets of items that are contained by doc1 and doc2, respectively. For the sake of simplicity, the following is the proposed STB-SM equations:

$$X = \left( {\mathop \sum \limits_{{t \in doc_{1} \cap doc_{2} }} W_{1j} } \right)* \left( {\mathop \sum \limits_{{t \in doc_{1} \cap doc_{2} }} W_{2j} } \right)$$

(8)

$$Y = \left( {\mathop \sum \limits_{{t \in doc_{1} \backslash doc_{2} }} W_{1j} } \right)* \left( {\mathop \sum \limits_{t \in doc\backslash doc} W_{2j} } \right)$$

(9)

$$Z = \left( {\mathop \sum \limits_{{t \in doc_{1} }} W_{1j} } \right)* \left( {\mathop \sum \limits_{{t \in doc_{2} }} W_{2j} } \right)$$

(10)

$${\text{STB}} - {\text{SM}}\left( {doc_{1} ,doc_{2} } \right) = \frac{X}{Z}*\left( {1 - \frac{Y}{Z}} \right)$$

(11)

where the notations “∩” and “\” denote the intersection and complement operators in the set theory, and W_ij is the weighting value. To further understand the mechanism of this measure and briefly clarify some deficit of the state-of-the-art measures, we have provided three examples as follows:

Example 1

Assuming we have doc1 (2, 5, 7, 8, 0, 9) and doc2 (9, 0, 0, 6. 5, 1), then STB-SM will work as follows; (for simplicity, X is x1 and x2; Y is y1 and y2, Z is z1 and z1, T_i.w suggests weighting of the term i)

	T1.w	T2.w	T3.w	T4.w	T5.w	T6.w
Doc1	2	5	7	8	0	9
Doc2	9	0	0	6	5	1

X1 = 2+8 + 9=19; X2 = 9+6 + 1=16; Z1 = 2+5 + 7+8 + 9=31; Z2 = 9+6 + 5+1 = 21; Y1 = 5+7 = 12; Y2 = 5

While STB-SM yielded (0.47 * 0.91 = 0.43) Cosine and Jaccard yielded (0.42) and (0.22) respectively.

Example 2

Assuming we have doc1 (02, 1, 1, 0, 1) and doc2 (3, 1, 1, 1, 1, 0), then STB-SM will work as follows;

	T1.w	T2.w	T3.w	T4.w	T5.w	T6.w
Doc1	0	2	1	1	0	1
Doc2	3	1	1	1	1	0

X1 = 4 X2 = 3 Z1 = 5 Z2 = 7 Y1 = 1 Y2 = 4

While STB-SM yielded (0.34 * 0.89 = 0.30), Cosine and Jaccard yielded (0.42) and (0.50) respectively.

Example 3

Assuming we have doc1 (1, 1, 3) and doc2 (1, 0, 2), then STB-SM will work as follows;

	T1.w	T2.w	T3.w
Doc1	1	1	3
Doc2	1	0	2

X1 = 4 X2 = 3 Z1 = 5 Z2 = 3 Y1 = 1 Y2 = 0

While STB-SM yielded (0.80), Cosine and Jaccard yielded (0.94) and (0.25) respectively.

As seen from the drawn examples above, Cosine occasionally finds a good similarity as indicated in example (1). However, the Cosine similarity gives the same value for both examples (1 & 2) albeit the clear difference between both vectors, and to further exacerbate the issue the similarity value is highly exaggerated in example 3. It is worth indicating that one novelty of STB-SM measure, is that the similarity value has never been exaggerated as shown in example (3) for Cosine, or the more state-of-the-art measure. STB-SM measure enables non-zero/non-shared features to have an explicit contribution to the similarity computation. Therefore, STB-SM takes the presence and absence of all features into consideration effectively.

On the other hand, Jaccard occasionally produces a good similarity as shown in example (2), but more frequently the Jaccard similarity is poor, as indicated in examples (1 & 3). Our proposed measure, therefore, comes to find a compromised solution where the desired effect is being detected. Examples (1 & 3) show a better and more accurate similarity found by STB-SM in comparison with the Cosine and Jaccard.

STB-SM analysis

In this subsection, we concisely as well as informatively analyze the cases of the proposed measure as follows;

The worst-case:

This case occurs when there is not even one shared feature between the document vectors.

Example (worst case): Assuming we have doc1 (3, 0, 1) and doc2 (0, 2, 0). By applying the worst-case scenario, we find that X1 = 0, X2 = 0; Z1 = 4, z2 = 2, y1 = 4, y2 = 2; because X = zero. Accordingly, STB-SM = zero, for both documents (1, 0, 1) and (0, 1, 0), which is logically true since there is no shared feature exist.

The average case:

This occurs when there has been at least one shared feature(s) as given in the drawn above examples (1–3). In this case, STB-SM would have a value in the range [0–1].

The best case:

This occurs when both vectors are completely equivalent.

Example (best case): Assuming we have doc1 (4, 4, 4) and doc2 (4, 4, 4), or doc1 (1, 1, 1) and doc2 (1, 1, 1). By applying the best-case scenario, we find that x1 = 9, x2 = 9, z1 = 9, z2 = 9, y1 = 0, y2 = 0. Accordingly, STB-SM = 1 which is logically true as both documents are identical.

The properties of similarity measures

According to [2, 4], six vital properties every similarity measure should have for the relative measure to be considered an optimal measure. The following properties are listed below;

Property 1:

The existence or non- existence of the intended feature is more vital than the difference between the values linked with the existing feature. According to the calculated-above examples, STB-SM explicitly takes the presence and absence of features into consideration.

Property 2:

The value of similarity should be grown as the difference between the values of non-zero features values decline. For instance, if we have f1 and f2 as two features belong to doc1 and doc2 respectively. Then, for doc1 and doc2, the value of similarity between f1 = 12 and f2 = 6 is higher than the similarity between f1 = 20 and f2 = 6. This property is also clearly shown in example 3, along with the worst-case example.

Property 3:

The value of similarity should be reduced as the number of existent or non- existent features rises. This was showcased in both the worst and best case examples, clearly indicating the applicability of his property.

Property 4:

Any pair of documents is low similar to each other if there have been many non-zero-valued features corresponding to many zero-valued features in the same pair. For instance, if we have two vectors for two documents doc1(f1,f2) = (1,0) and doc2(f3, f4) = (1,1). Then, doc1.f2 and doc2.f4 are the key cause for lowering the similarity between both documents as f2 X f4 = 0 and, at the same time, f2 + f4 > 0. Example 2 supports the applicability of this property.

Property 5:

The similarity measure should possess asymmetrical features. For instance, the similarity between both doc1 (1, 1, 0) and doc2 (1,1,1) must be the same when doc 2(1,1,1) and doc 1(1, 1, 0) are considered. According to the drawn above examples, STB-SM enjoys this property completely.

Property 6:

The distribution value should have a contribution to the similarity between every two documents. That means features with higher spread (standard deviation) contribute more in similarity than that of a lower spread.

Experimental setup

Text pre-processing

Some operations were carried out normally for the text to be transformed into text vectors for processing. The text was converted from the lower case to upper case, numbers, punctuations, and stop words (common words), in addition to that extra white space were all removed, and some particular symbols (such as $, %)were converted into spaces.

Text representation

The bag of words (BoW) model [26, 27] was used to represent documents that were in the vector space model (VSM). The BoW model represents each document as a word collection disregarding the grammar and word order [28].

Given the fact that we have used a python to run the text pre-processing, the preprocessing was performed using the Ntlk (Natural language toolkit) library of python as follows;

Tokenization: using the ntlk word tokenizer
Converting all the words to lower case: using the lower() python string function
Lemmatizing: using the ntlk stem WordNetLemmatizer
Stopword Removal: using the ntlk stopwords
Considering words with only 4 or more letters

The comparison mechanism of classification

After pre-processing, all of the documents were represented using the BoW model in VSM in order for the classification process to start smoothly. Following that, the performance of every similarity measured across the different kinds of documents was compared and evaluated against each other. Six evaluation measures were used to evaluate, namely, accuracy, precision, recall, F-measure, G-measure, and average mean precision. For each criterion, the KNN algorithm runs from K = 1 to K = 120 over each number of features of each dataset, and the averaged results were accumulated and drawn as given in the Tables below (5, 6, 7, 8, 9, 10, 11, 12, 13). Number of features (NF) was varied from NF = 10, NF = 50; NF = 100, NF = 200; NF = 350, NF = 3000, NF = 6000 and NF = the whole number of features (see Appendix samples). In consequence, we have eight runs for the KNN algorithm over two datasets to test and examine six criteria using eight similarity measures. The final number of implementations performed to have the results below were (8 × 2 × 6 × 8 = 768) runs. If we also consider the sixty (60) values of K that have been tested in each KNN cycle, the total runs would be 46080.

Term weighting

We adopted the most widely used Term Frequency (TF) technique of weighting which simply gives the occurrence of each word in the relative document [29, 30].

K-nearest neighbor classifier

The K-nearest neighbor algorithm (k-NN) is most widely used, in the IR literature, to perform document classification. Although it is a lazy algorithm [27], it is nonparametric, simple, and believed to be amongst the top ten algorithms in data mining [31]. It works based on selecting the nearest points to the point at the question. The concept of K-NN is that the points that exist in the same class are highly likely to be close to one another depending on the used similarity measure. KNN assumes the next: (1) Points in the feature space have a specific distance between each other and that distance is used as a metric to gauge closeness, (2) Each point in the training points has its vector and class label. Later, a certain number “k” is determined to draw the neighboring area of the point in question.

K-means clustering algorithm

Generally speaking, the clustering of a huge text dataset can be efficaciously made through utilizing the algorithms of partitional clustering. One of the most-popular partitional clustering algorithms is the K-means algorithm. It is widely known in the literature to be the best-fit approach for handling huge volumes of datasets [8, 32]. Similarly to any clustering algorithm, K-means leverages a similarity measure that finds the similarity between each document and the document representative of the cluster (head of the cluster). The similarity measure represents the core of the clustering process by which clustering algorithm performance is analyzed. However, the most suitable similarity measure to effectively perform clustering is still an open-ended challenge. In our work, for the clustering performance analysis, we ran the K-means for each similarity measure, as well as the values of evaluation of metrics (external metrics including purity, completeness and rand index, and internal metrics including the Calinski-Harabasz index and Davies-Bouldin index) were drawn accordingly. We used the voting technique to determine the best similarity measure that would best fit the K-means algorithm. The voting technique worked by enumerating how many metrics each similarity measure had achieved its best values. The bigger number of metrics is the best fit which is the similarity measure. According to the experimental results of the clustering process, our proposed measure (STB-SM) has been seen as the best fit in most cases. It has achieved (11) out of the (20) points by being the best in four metrics out of five. Unfortunately, in the K-means algorithm, the number of clusters is still an ill-posed problem as stated in [32, 33]. Therefore, in this study, we have picked numbers (4 and 8) to be the number of clusters just to analyze and emphasize the behavior of all the similarity measures. It is worth referring that we are not arguing that (K = 4 or K = 8) is optimal or the best value for the number of clusters. It is just chosen as the number of actual classes in each dataset [34] to draw the performance analysis of K-means using the considered similarity measures. In the follow-up work, we plan to further examine the performance analysis with several K numbers of clusters, and at the same time, with other clustering algorithms, like hierarchical clustering algorithms.

Machine description

Table 1 displays the machine and environment descriptions used to perform this work.

Table 1 Machine and environment description

A set theory based similarity measure for text clustering and classification

Abstract

Introduction

Related work

The most relevant similarity measures

Euclidean distance (ED)

Manhattan

Cosine similarity measure

Jaccard similarity measure

Bhattacharya coefficient

Kullback–Leibler divergence

PDSM

The set theory

Definition 1

Definition 2

Definition 3

Operation 1—union

Operation 2—intersection

Operation 3—negation

The proposed similarity measure (STB-SM)

The formulation of STB-SM similarity measure

Example 1

Example 2

Example 3

STB-SM analysis

The properties of similarity measures

Property 1:

Property 2:

Property 3:

Property 4:

Property 5:

Property 6:

Experimental setup

Text pre-processing

Text representation

The comparison mechanism of classification

Term weighting

K-nearest neighbor classifier

K-means clustering algorithm

Machine description

Dataset description

The classification evaluation criterions

Accuracy (ACC)

Precision (PRE)

Recall (REC)

F-measure or F-Score (FM)

G-method or G-Score (GM)

Average mean precision (AMP)

The clustering evaluation criterions

Accuracy (also known as Purity)

Completeness

Rand index

Calinski-Harabasz index

Davies-Bouldin index

Experimental results

Classification results

Clustering results

Discussion

Classification- performance stability

Classification-performance climax

Performance analysis-reuters

Performance analysis-web-Kb

Classification-execution time analysis

Clustering analysis

Clustering–execution time analysis

The applicability of proposed measure (STB-SM) on big data environment

Conclusions and future work

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors’ information

Corresponding author

Ethics declarations

Competing interests

Additional information