 Research
 Open Access
 Published:
A set theory based similarity measure for text clustering and classification
Journal of Big Data volume 7, Article number: 74 (2020)
Abstract
Similarity measures have long been utilized in information retrieval and machine learning domains for multipurposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other textprocessing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an openended challenge. This study, in consequence, introduces a new highlyeffective and timeefficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the Knearest neighbor algorithm (KNN) for classification, the Kmeans algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters21 and WebKB. The obtained results confirm that the proposed set theorybased similarity measure (STBSM), as a preeminent measure, outweighs all stateofart measures significantly with regards to both effectiveness and efficiency.
Introduction
In information retrieval and machine learning, a good number of techniques utilize the similarity/distance measures to perform many different tasks [1]. Clustering and classification are the most widelyused techniques for the task of knowledge discovery within the scientific fields [2,3,4,5,6,7,8,9,10]. On the other hand, text classification and clustering have long been vital research areas of information retrieval (IR). While text classification is the process of classifying the text/document into its actual class by utilizing a similarity measure and a proper classifier. The clustering, on the other hand, is the process of grouping similar texts into similar groups called clusters. As a matter of fact, with the everpiling amount of data and information on the internet, the necessity for a highly effective classification algorithm is urgent. Nevertheless, the enhancement of classification performance has still been the main task for researchers in the text mining field. Given the fact that the similarity/distance measures are the core component of the classification and clustering algorithm, their efficiency and effectiveness directly impact techniques’ performance in one way or another. Therefore, the selection of the best similarity measure for the techniques in question is still an openended challenging task.
Even though there have been several proposed works in IR literature to compare the similarity/distance measures for clustering and classification purposes [2, 3, 11,12,13,14,15,16], these studies are still incapable of providing a comprehensive preview of the actual performance of similarity measures. Besides, some of those works have presented an efficient similarity measure while ignoring effectiveness [17, 21]. While the others have presented only an effective similarity measure while ignoring efficiency [2,3,4]. Consequently, this work comes to cover this critical limitation by introducing a compromised (effective and timeefficient) similarity measure while the most widely used similarity measures are elegantly investigated in a thorough pattern under numerous circumstances. Using the Knearest neighbor classifier (KNN), Kmeans clustering algorithm, and the bag of words (BoW) representation model [17,18,19] for feature selection, the similarity measures are examined in details. The K values (in KNN) is varied from (1) to (120) and the number of features is set to be in (50, 100, 300, 1000, 3000, 6000, and the whole number of features of the considered dataset). In doing so, the superiority of STBSM measure is emphasized, and each measure is tested under several circumstances so the desired effectiveness including accuracy is being obtained in certain K values on several features. These measures are evaluated against low dimensional datasets (by studying their performance on 50, 100, 200, and 350) and high dimensional datasets (by studying their performance on 3000, 6000, and the number of all features of the dataset). The measures’ behavior has been analyzed to determine which measure gives the best results in certain K values on a specific number of features. Furthermore, for the clustering performance analysis, five evaluation metrics were employed with two of them are internal and three are external. The key objective of this work is to present a new competitive measure, compare and benchmark the similarity measures performance on the targeted datasets on both the low and the highdimensional datasets. Briefly, the main contributions of this work are listed below:

1.
Introducing a novel similarity measure for text retrieval that basically behaves based on the set theory mechanism. This measure has been named a set theory based similarity measure for text retrieval (STBSM). In accordance with the experimental results of both classification and clustering, STBSM has been shown to be a promising measure with its being superior over the existing stateoftheart measures.

2.
Along with proposing the STBSM, seven similarity measures, that are commonly applied for text retrieval and machine learning purposes, are thoroughly investigated and evaluated to benchmark their impact on text retrieval. They are comprehensively tested on two of the most publicly available datasets (namely, webKB and Reuters21). Using BoW, a thorough comparative analysis for these measures, in terms of their effectiveness and efficiency, are drawn. While the classification effectiveness includes six evaluation factors, namely; accuracy, precision (PRE), recall (REC), FMeasure (FM), GMeasure (GM) and Average Precision Mean (AMP). The clustering effectiveness includes five evaluation metrics namely, Purity, Completeness and Rand Index as the external metrics, along with CalinskiHarabasz index and DaviesBouldin index as the internal metrics. Moreover, for both classification and clustering efficiency, the run time, taken by each measure to find the similarity degree, is rigorously observed.

3.
The scope of this work concentrates on promoting the performance of text clustering and classification through a new measure along with a detailed comparative analysis for the proposed measure against the stateofart BoWbased similarity measures. The drawn analyses would provide an influential guide for the selection of similarity measures in terms of considered datasets as well as helping researchers in fully understanding the present and future challenges linked with text retrieval.
The rest of this paper is structured as follows: the most relevant similarity measures for this study are concisely presented in Sect. “Related work”. Section “The set theory” briefly describes the basics and definitions of set theory in the context of text retrieval. Section “The proposed similarity measure (STBSM)” defines, formulates, and analyzes the proposed similarity measure in the context of the set theory. The experimental setup is drawn in Sect. “Experimental setup”. The results of the work are given in Sect. “Experimental results”. The discussion is profoundly detailed in Sect. “Discussion”. Finally, conclusions and future work recommendations are presented in Sect. “Conclusions and future work”.
Related work
Vector Space Model (VSM) has long been used to represent document(s) when dealing with text retrieval. In VSM, each document is drawn as an Ndimensional vector. Each dimension represents a vocabulary term/feature. In information retrieval (IR) literature, there are a good number of similarity measures to compute the pairwise document similarity using VSM. While there have been some works that have been proposed in the IR literature to perform the clustering along with the classification using the similarity/distance measures [2,3,4, 11,12,13,14,15,16]. These works lack the comprehensive preview of the actual performance of similarity measures. Moreover, some of them have proposed efficient similarity measures irrespective of their effectiveness [21, 22]. Other works, however, have presented only effective similarity measures without consideration to their efficiency [2,3,4].
Euclidean and Manhattan distances are among the most famous geometric measures which have been utilized to find the distance between each vector pair [2, 20]. Similarly, Cosine similarity finds similarity between each document pair using the angle between their vectors [10]. The triangle distance is also looked at as the Cosine of a triangle between vector pair [10]. The value of this measure range between 0 and 2. On the other hand, for 0–1 vectors, the Hamming distance [4] is used to give the number of positions at which the feature weights are not equal. Kullback–Leibler divergences [23, 24], KLD, as a nonsymmetric measure was used in [24] to compute the similarity between each vector pair using the probability distribution that is associated with the both vectors. In [4], a similarity measure for text processing, named SMTP, was found to calculate the similarity between document pair. An InformationTheoretic measure (ITSim), was proposed based on information theory in [18] for document Similarity purposes. In [3], a new similarity measure called Improved SqrtCosine (ISC) was proposed. Meanwhile, Bhattacharya coefficient was invented in [21] to approximately calculate the overlap rate between each statistical sample pair. Jaccard coefficient was developed in [25] to find similarity using the ratio of the number of features existing in both documents to the number of features existing in at least one of them. Subsequently in [2], a new similarity measures named pairwise document similarity measure based on present term set (PDSM), was presented based on the feature weights as well as the number of features that existed in at least one of the considered documents.
Some of these measures have shown to be highly effective such as the PDSM [2], the ISC [3], and the SMTP [4], yet unfortunately timeinefficient. In contrast, some measures are not effective yet highly efficient notably the Euclidean and Manhattan. Cosine, on the other hand, has been seen as a compromised solution as an effective and highly efficient measure. Furthermore, as reported in IR literature, almost all of these measures were tested in the context of text classification and clustering. For example, PDSM was compared in [2] with five similarity measures in terms of classification and near duplicate application. Likewise, ISC [3] and SMTP [4] were evaluated against several similarity measures concerning text classification and clustering. Similarly, our proposed paper of this work has been evaluated against some of the most widely used similarity measures in machine learning and information retrieval literature, particularly with respect to text classification and clustering. Finally, [7] assessing the clustering performance of several measures on three collections of web documents. The experimental results of their experiment revealed that Cosine similarity outweighs both the Jaccard coefficient and the Euclidean distance.
The most relevant similarity measures
In this subsection, the similarity measures that are considered to conduct this study are presented. Seven similarity measures are introduced as the most widely used measures for text clustering and classification [2, 20,21,22,23,24]. These similarity measures work by considering the terms’ presence and absence, or by evaluating the angle between each vector pairs or by finding the distance. Assuming that we have two documents doc1 and doc2 that have two vectors d1 and d2, the aim is to find how much similarities are there when using the intended similarity measure as follows;
Euclidean distance (ED)
Every document is drawn as a point in 2D space depending on the term frequency of N terms that would represent the N dimension. ED finds the similarity between each point pair in Ndimensional space using their coordinate based on the following equation:
Manhattan
Manhattan distance (known as sumnorm) finds the sum of absolute differences between the targeted coordinates of each document pair vectors as follows:
Cosine similarity measure
The Cosine similarity calculates the pairwise similarity between the document pairs using the dot product and the magnitude of both vectors of both documents. It is mostly utilized within the scientific fields including the IR field [20], and is defined as follows:
The union is used to normalize the inner product.
Jaccard similarity measure
This coefficient was invented in [25] to divide the intersection of the points by their unions, and the value of coefficient ranges between 0 (there is no similarity between the documents) and 1 (both documents are identical). The Jaccard similarity is given by the next equation:
Bhattacharya coefficient
The Bhattacharyya coefficient is used to approximately calculate the overlap rate between each statistical sample pair [21]. In our works, however, these samples are thought of as documents. This coefficient is being utilized to find the approximate closeness of each document pair.
Kullback–Leibler divergence
It is also known as a “relative entropy” [23, 24]. It is used to measure the difference between probability distributions. Simply, when this measure reaches 0, it signals that the intended distributions pair is identical, following that, its equation is then drawn as follow;
PDSM
This measure has been introduced in [2] to tackle the limitation of thestateofart measures which included a number of present terms into account. PDSM was seen effective according to the experimental results of [2] as well as the experimental results of our current work. The PDSM equation is formulated as follows
where
where \(PF(doc_{i1} doc_{i2} )\) represents the number of present terms and \(AF(doc_{i1} doc_{i2} )\) represents the number of absent terms and M is the total number of documents.
The set theory
Before introducing the proposed measure, some basics and definitions (upon which our measure behaves) for the set theory in the context of text retrieval should be conceived. So, in this section, the main objective is to introduce the relative set theory operations upon which our proposed measure behaves.
Generally speaking, the set theory is a vital component of modern mathematics and is widely used in all formal descriptions. The set can be a collection, a group, or even a cluster of points that are named members of that set. For instance, a set of documents is a collection of documents, or a set of people is a group of people, etc. For each point to be a member of that set, its membership shall be defined clearly. However, sometimes, due to the lack of information, membership definition is a difficult task and may even be a vague. So, if the membership definition is vague for some collection, the collection is then cannot be called a set. Simply put, if there has been a set S and its two members X and Y, then it shall not be unknown whether X = Y or they are not. Strictly speaking, the set can be either finite, infinite, or empty. In the following, some basic definitions and key operations are introduced to further understand the basics upon which STBSM measure behaves.
Definition 1
If we have two sets S1 and S2, both sets are equal if and only if they have the same points, and then every \(X \in S1 \Leftrightarrow X \in S2.\) For example, in the context of text retrieval, if we have Doc1{Ali, Jun, Sarah} and Doc2{Jun, Sarah, Ali}. Then, we can say that Doc1 = Doc2, and they are both identical as every word belongs to Doc1 also belongs to Doc2.
Definition 2
If we have two sets S1 and S2, S1 is “a proper” subset of S2 (S1 \(\subseteq\) S2) if there has been X \(\in\) S1 and also X \(\in\) S2 as well. For example, in the context of text retrieval, if we have Doc1{Ali, Hassan, Sarah} and Doc2{Hassan, Sarah, Ali, Mark, Farah}. Then, we can say that Doc1 \(\subseteq\) Doc2, and Doc1 is a proper subset of S2 as every word belongs to Doc1 also belongs to Doc2.
Definition 3
he document doc is a collection of terms of vectors that holds these terms, that is, any subset of C, when C is the document collection, (involving C itself).
Let doc be a document, a subset of C. We say that doc exists as a vector if the terms of doc exist in the doc itself. First, let us define the key relationships between each document pair doc1 and doc2 in the collection C, as follows;
So, for the given document pair doc1 and doc2, the following set of operations are held as follows;
Operation 1—union
The union of two sets S1 and S2 (S1 \(\cup\) S2), is the set that contains all the elements of both sets S1 and S2 with the removal of duplication.
In the context of text retrieval, the Union operation of doc1 and doc2, doc1 \(\cup\) doc2, is the group of terms {t_{1},…, t_{n}} where n is the number of addressed terms in both documents, that are involved in either doc1, doc2 or both:
Operation 2—intersection
The Intersection of two sets S1 and S2 (S1 \(\cap\) S2), is the set that contains shared elements of sets S1 and S2.
In the context of text retrieval, the Intersection operation of doc1 and doc2, doc1 \(\cap\) doc2, is the group of terms {t_{1},…, t_{n}} where n is the number of addressed terms in both documents, that are involved in both documents doc1 and doc2 at the same time:
Operation 3—negation
The negation operation of doc1 or doc2, doc1/doc2 or doc2/doc1, is the group of terms that are either belongs to doc2/doc1 or doc1/doc2:
The proposed similarity measure (STBSM)
The formulation of STBSM similarity measure
Suppose we have a document pair doc 1 and doc2. Let doc1 = (w_{11}, w_{12},…) and doc2 = (w_{21}, w_{22},…) be the weighting vectors (using BoW model) of the term sets for document 1 and document 2, respectively. Let T_{1} {t_{11}, t_{12},… t_{1n}} and T_{2} {t_{21}, t_{22},… t_{2n}} be the sets of items that are contained by doc1 and doc2, respectively. For the sake of simplicity, the following is the proposed STBSM equations:
where the notations “∩” and “\” denote the intersection and complement operators in the set theory, and W_{ij} is the weighting value. To further understand the mechanism of this measure and briefly clarify some deficit of the stateoftheart measures, we have provided three examples as follows:
Example 1
Assuming we have doc1 (2, 5, 7, 8, 0, 9) and doc2 (9, 0, 0, 6. 5, 1), then STBSM will work as follows; (for simplicity, X is x1 and x2; Y is y1 and y2, Z is z1 and z1, T_{i}.w suggests weighting of the term i)
T1.w  T2.w  T3.w  T4.w  T5.w  T6.w  
Doc1  2  5  7  8  0  9 
Doc2  9  0  0  6  5  1 
X1 = 2+8 + 9=19; X2 = 9+6 + 1=16; Z1 = 2+5 + 7+8 + 9=31; Z2 = 9+6 + 5+1 = 21; Y1 = 5+7 = 12; Y2 = 5
While STBSM yielded (0.47 * 0.91 = 0.43) Cosine and Jaccard yielded (0.42) and (0.22) respectively.
Example 2
Assuming we have doc1 (02, 1, 1, 0, 1) and doc2 (3, 1, 1, 1, 1, 0), then STBSM will work as follows;
T1.w  T2.w  T3.w  T4.w  T5.w  T6.w  
Doc1  0  2  1  1  0  1 
Doc2  3  1  1  1  1  0 
X1 = 4 X2 = 3 Z1 = 5 Z2 = 7 Y1 = 1 Y2 = 4
While STBSM yielded (0.34 * 0.89 = 0.30), Cosine and Jaccard yielded (0.42) and (0.50) respectively.
Example 3
Assuming we have doc1 (1, 1, 3) and doc2 (1, 0, 2), then STBSM will work as follows;
T1.w  T2.w  T3.w  
Doc1  1  1  3 
Doc2  1  0  2 
X1 = 4 X2 = 3 Z1 = 5 Z2 = 3 Y1 = 1 Y2 = 0
While STBSM yielded (0.80), Cosine and Jaccard yielded (0.94) and (0.25) respectively.
As seen from the drawn examples above, Cosine occasionally finds a good similarity as indicated in example (1). However, the Cosine similarity gives the same value for both examples (1 & 2) albeit the clear difference between both vectors, and to further exacerbate the issue the similarity value is highly exaggerated in example 3. It is worth indicating that one novelty of STBSM measure, is that the similarity value has never been exaggerated as shown in example (3) for Cosine, or the more stateoftheart measure. STBSM measure enables nonzero/nonshared features to have an explicit contribution to the similarity computation. Therefore, STBSM takes the presence and absence of all features into consideration effectively.
On the other hand, Jaccard occasionally produces a good similarity as shown in example (2), but more frequently the Jaccard similarity is poor, as indicated in examples (1 & 3). Our proposed measure, therefore, comes to find a compromised solution where the desired effect is being detected. Examples (1 & 3) show a better and more accurate similarity found by STBSM in comparison with the Cosine and Jaccard.
STBSM analysis
In this subsection, we concisely as well as informatively analyze the cases of the proposed measure as follows;
The worstcase:
This case occurs when there is not even one shared feature between the document vectors.
Example (worst case): Assuming we have doc1 (3, 0, 1) and doc2 (0, 2, 0). By applying the worstcase scenario, we find that X1 = 0, X2 = 0; Z1 = 4, z2 = 2, y1 = 4, y2 = 2; because X = zero. Accordingly, STBSM = zero, for both documents (1, 0, 1) and (0, 1, 0), which is logically true since there is no shared feature exist.
The average case:
This occurs when there has been at least one shared feature(s) as given in the drawn above examples (1–3). In this case, STBSM would have a value in the range [0–1].
The best case:
This occurs when both vectors are completely equivalent.
Example (best case): Assuming we have doc1 (4, 4, 4) and doc2 (4, 4, 4), or doc1 (1, 1, 1) and doc2 (1, 1, 1). By applying the bestcase scenario, we find that x1 = 9, x2 = 9, z1 = 9, z2 = 9, y1 = 0, y2 = 0. Accordingly, STBSM = 1 which is logically true as both documents are identical.
The properties of similarity measures
According to [2, 4], six vital properties every similarity measure should have for the relative measure to be considered an optimal measure. The following properties are listed below;
Property 1:
The existence or non existence of the intended feature is more vital than the difference between the values linked with the existing feature. According to the calculatedabove examples, STBSM explicitly takes the presence and absence of features into consideration.
Property 2:
The value of similarity should be grown as the difference between the values of nonzero features values decline. For instance, if we have f1 and f2 as two features belong to doc1 and doc2 respectively. Then, for doc1 and doc2, the value of similarity between f1 = 12 and f2 = 6 is higher than the similarity between f1 = 20 and f2 = 6. This property is also clearly shown in example 3, along with the worstcase example.
Property 3:
The value of similarity should be reduced as the number of existent or non existent features rises. This was showcased in both the worst and best case examples, clearly indicating the applicability of his property.
Property 4:
Any pair of documents is low similar to each other if there have been many nonzerovalued features corresponding to many zerovalued features in the same pair. For instance, if we have two vectors for two documents doc1(f1,f2) = (1,0) and doc2(f3, f4) = (1,1). Then, doc1.f2 and doc2.f4 are the key cause for lowering the similarity between both documents as f2 X f4 = 0 and, at the same time, f2 + f4 > 0. Example 2 supports the applicability of this property.
Property 5:
The similarity measure should possess asymmetrical features. For instance, the similarity between both doc1 (1, 1, 0) and doc2 (1,1,1) must be the same when doc 2(1,1,1) and doc 1(1, 1, 0) are considered. According to the drawn above examples, STBSM enjoys this property completely.
Property 6:
The distribution value should have a contribution to the similarity between every two documents. That means features with higher spread (standard deviation) contribute more in similarity than that of a lower spread.
Experimental setup
Text preprocessing
Some operations were carried out normally for the text to be transformed into text vectors for processing. The text was converted from the lower case to upper case, numbers, punctuations, and stop words (common words), in addition to that extra white space were all removed, and some particular symbols (such as $, %)were converted into spaces.
Text representation
The bag of words (BoW) model [26, 27] was used to represent documents that were in the vector space model (VSM). The BoW model represents each document as a word collection disregarding the grammar and word order [28].
Given the fact that we have used a python to run the text preprocessing, the preprocessing was performed using the Ntlk (Natural language toolkit) library of python as follows;

Tokenization: using the ntlk word tokenizer

Converting all the words to lower case: using the lower() python string function

Lemmatizing: using the ntlk stem WordNetLemmatizer

Stopword Removal: using the ntlk stopwords

Considering words with only 4 or more letters
The comparison mechanism of classification
After preprocessing, all of the documents were represented using the BoW model in VSM in order for the classification process to start smoothly. Following that, the performance of every similarity measured across the different kinds of documents was compared and evaluated against each other. Six evaluation measures were used to evaluate, namely, accuracy, precision, recall, Fmeasure, Gmeasure, and average mean precision. For each criterion, the KNN algorithm runs from K = 1 to K = 120 over each number of features of each dataset, and the averaged results were accumulated and drawn as given in the Tables below (5, 6, 7, 8, 9, 10, 11, 12, 13). Number of features (NF) was varied from NF = 10, NF = 50; NF = 100, NF = 200; NF = 350, NF = 3000, NF = 6000 and NF = the whole number of features (see Appendix samples). In consequence, we have eight runs for the KNN algorithm over two datasets to test and examine six criteria using eight similarity measures. The final number of implementations performed to have the results below were (8 × 2 × 6 × 8 = 768) runs. If we also consider the sixty (60) values of K that have been tested in each KNN cycle, the total runs would be 46080.
Term weighting
We adopted the most widely used Term Frequency (TF) technique of weighting which simply gives the occurrence of each word in the relative document [29, 30].
Knearest neighbor classifier
The Knearest neighbor algorithm (kNN) is most widely used, in the IR literature, to perform document classification. Although it is a lazy algorithm [27], it is nonparametric, simple, and believed to be amongst the top ten algorithms in data mining [31]. It works based on selecting the nearest points to the point at the question. The concept of KNN is that the points that exist in the same class are highly likely to be close to one another depending on the used similarity measure. KNN assumes the next: (1) Points in the feature space have a specific distance between each other and that distance is used as a metric to gauge closeness, (2) Each point in the training points has its vector and class label. Later, a certain number “k” is determined to draw the neighboring area of the point in question.
Kmeans clustering algorithm
Generally speaking, the clustering of a huge text dataset can be efficaciously made through utilizing the algorithms of partitional clustering. One of the mostpopular partitional clustering algorithms is the Kmeans algorithm. It is widely known in the literature to be the bestfit approach for handling huge volumes of datasets [8, 32]. Similarly to any clustering algorithm, Kmeans leverages a similarity measure that finds the similarity between each document and the document representative of the cluster (head of the cluster). The similarity measure represents the core of the clustering process by which clustering algorithm performance is analyzed. However, the most suitable similarity measure to effectively perform clustering is still an openended challenge. In our work, for the clustering performance analysis, we ran the Kmeans for each similarity measure, as well as the values of evaluation of metrics (external metrics including purity, completeness and rand index, and internal metrics including the CalinskiHarabasz index and DaviesBouldin index) were drawn accordingly. We used the voting technique to determine the best similarity measure that would best fit the Kmeans algorithm. The voting technique worked by enumerating how many metrics each similarity measure had achieved its best values. The bigger number of metrics is the best fit which is the similarity measure. According to the experimental results of the clustering process, our proposed measure (STBSM) has been seen as the best fit in most cases. It has achieved (11) out of the (20) points by being the best in four metrics out of five. Unfortunately, in the Kmeans algorithm, the number of clusters is still an illposed problem as stated in [32, 33]. Therefore, in this study, we have picked numbers (4 and 8) to be the number of clusters just to analyze and emphasize the behavior of all the similarity measures. It is worth referring that we are not arguing that (K = 4 or K = 8) is optimal or the best value for the number of clusters. It is just chosen as the number of actual classes in each dataset [34] to draw the performance analysis of Kmeans using the considered similarity measures. In the followup work, we plan to further examine the performance analysis with several K numbers of clusters, and at the same time, with other clustering algorithms, like hierarchical clustering algorithms.
Machine description
Table 1 displays the machine and environment descriptions used to perform this work.
Dataset description
Reuters Dataset (Table 2): ReutersR8 Dataset holds the eight most frequent classes of the original ninety classes in Reuter’s dataset. After applying preprocessing, a total of 18308 features were extracted.
Webkb dataset (Table 3): it consists of web pages of the computer science department from the following universities: Cornell, Texas, Washington, and Wisconsin. It was obtained from the World Wide Knowledge Base project of the CMU text learning group. After applying the preprocessing, a total of 33,025 features were extracted. The data in both datasets were divided into training and testing in ratio 2:1 (67%: 33%). To overcome the overfitting or underfitting issue, instead of dividing the whole data randomly in the training and testing data, each group was divided individually and then combined as training and testing data. Both datasets are read directly from Python platform as they are already integrated with python.
The classification evaluation criterions
This subsection holds the evaluation criterions used for classification as follows;
Accuracy (ACC)
ACC checks the total of samples that are correctly classified out of the whole sample collection. ACC is defined in the next equation.
Precision (PRE)
PRE checks the total number of items that are correctly identified as positive out of the total items identified as positive.
Recall (REC)
REC checks the total number of items that are correctly identified as positive out of the actual positive.
Fmeasure or FScore (FM)
FM is the harmonic mean of precision and recall. It is useful when classes are not distributed evenly.
Gmethod or GScore (GM)
GM is the geometric mean of precision and recall. It is also used when classes are not distributed evenly.
Average mean precision (AMP)
AMP is the mean of the average precision of all classes. This is used to evaluate how precisely the classifier is performing.
where P_{n} and R_{n} are the precision and recall at the nth threshold. Finally, True Positive, True Negative, False Positive and False Negative are defined as follow;
True positive: the number of class1 testing documents that are correctly identified into class1.
True negative: the number of instances of class2, class3,…., classN correctly identified as class2, class3.. classN respectively.
False positive: the number of class1 testing documents that are incorrectly identified into class2, class3,….., classN.
False negative: the number of class2, class3,….., classN testing documents that are incorrectly identified into class1.
The clustering evaluation criterions
This subsection holds the evaluation criterions used for clustering. While the external metrics require actual labels to assess the cluster quality (see Eqs. 18, 19, 20), the internal metrics do not require actual labels to assess the cluster quality (see Eqs. 21, 22).
Accuracy (also known as Purity)
It is used to check the index to which a cluster is pure. Particularly, every cluster has only one class and different clusters have different classes. In other words, this metric evaluates the coherence of a cluster. It is defined by the following equation.
Where N is the number of objects(data points), k is the number of clusters, c_{i} is a cluster in C, and t_{j} is the classification which has the max count for cluster c_{i}.
Completeness
To check whether all members of a given class are assigned to the same cluster.
where
Rand index
It is used to check how many points are correctly predicted.
where n is the total number of samples, and (a + b) is the agreement between real and the assigned cluster label.
CalinskiHarabasz index
It is used to measure the ratio between cluster dispersion and inter cluster dispersion.
where
where Cq is the set of points in cluster q, cq is the center of cluster q, c_{E} the center of E, and n_{q} is the number of points in cluster q.
DaviesBouldin index
This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves.
where
where s_{i} is the average distance between each point of cluster i and the centroid of that cluster, d_{ij} is the distance between cluster centroids i and j. Finally, the best and worst values and the range of each measure are drawn in Table 4.
Experimental results
Classification results
This work investigated all considered measures comprehensively based on six criterions for performance evaluation which is the first study of its type to do such investigation in the information retrieval field with respect to text classification. The K values of KNN were varied from (1) to (120) with an increment of value (2) in each cycle (see Appendix samples). The number of features of each dataset was diversified (10, 50, 100, 200, 350, 3000, 6000, NF) to clearly draw the best performance of each measure under several circumstances. Then, for each measure, the results were averaged for all K values on each NF to yield the results drawn in Tables (5, 6, 7, 8, 9, 10, 11, 12, 13). In other words, the following tables contain the results of each similarity measure which were averaged on each Number of Features (NF) over all K values in the range [1…120] as drawn in the appendix. In each table, the averaged results of all K values for each performance criterion is displayed. Table 5 displays the averaged results of all criterions when NF = 10. For simplicity, we draw the averaged results of all measures while analyzing briefly three criterions, namely, ACC, FM, and AMP.
As shown in Table 5, for the Reuters dataset, Euclidean, followed by STBSM and Cosine, met the highest accuracy. However, STBSM, followed by Euclidean and kullback–Leibler, outperformed all measures on both FM and AMP criterions. On the other hand, on the WebKB dataset, PDSM, followed by STBSM and Cosine, outperformed all similarity measures in ACC. In regard to FM and AMP, Cosine, followed by STBSM and PDSM, outweighed all measures with STBSM being superior to PDSM on FM and PDSM being superior to STBSM on AMP. So, the best measures, when NF = 10, were Euclidean, STBSM, and Cosine on Reuters, and PDSM followed by STBSM and Cosine on WebKB.
Tables 6, 7, 89, 10, show that, for both Reuters and WebKB, STBSM, followed by PDSM, and Cosine, achieved the highest ACC, FM and AMP respectively. However, two exceptions are noted as follows; the first exception is when NF = 350, Cosine outweighed PDSM in terms of FM and AMP on both Reuters and WebKB. The second exception is that when NF = 3000, Cosine outperformed PDSM in terms of FM and AMP on Reuters only. Nevertheless, the top performer measures, when NF in the range [503000], were STBSM, PDSM, and Cosine.
Reuters in Tables 11, 12, similarly to Table (10), STBSM, followed by PDSM and Cosine, had been superior with the highest ACC, FM, and AMP respectively. Moreover Cosine outweighed PDSM in terms of FM and AMP. In contrast, on WebKB, PDSM, followed by STBSM and Jaccard, had been superior in terms of ACC and AMP. However, Cosine had been superior to Jaccard in terms of FM only. So, the top performer measures, when NF in the range [6000All features], were STBSM, PDSM, Cosine, and Jaccard. It is worth mentioning that from Table 6 , 7, 8, 9, 10, 11, 12, results have been almost the same. In other words, results have been noted in stable condition.
Finally in Table 13, when the average of averaged results has been taken, it is clear that, for both Reuters and WebKB, STBSM, PDSM, and Cosine have been the best measures for all criterions. Thus, in conclusions, the top performer measures, when the average results have been taken, are STBSM, PDSM, and Cosine.
Clustering results
In this subsection, we have evaluated and compared the impact of all considered similarity measures on the behavior of the Kmeans clustering algorithm. Fixing K on (4 and 8) and using the presentedabove evaluation metrics for clustering (see Table 4), the experiments have been conducted on both datasets (Reuters and WebKB) to experimentally identify and distinguish which measure would be the best fit for Kmeans. Either positively or negatively, the experiments could clearly signify the selection of similarity measure impact on clustering quality. All features of both datasets were considered when the clustering process has been running (Reuters = 18,308, WebKB = 33,025 features). As stated earlier, we have used two internal metrics, and three external metrics for clustering evaluation of Kmeans under the umbrella of all considered similarity measures. As for the stopping condition, the Kmeans was allowed to stop after running (50) iterations to obtain the best performance, or alternatively, the algorithm reached the stability situation for two consecutive cycles. The stability situation is the case in which Kmeans clusters were recorded stable (unchanged) for two consecutive cycles. Centroids of clusters were chosen randomly in each iteration. We have used the voting technique (see Table 20) to decide the best fit similarity measure using which the performance of Kmeans has been noted to be the highest. According to the results drawn in Tables (14, 15, 16, 17, 18), STBSM, followed by PDSM and Euclidean, has been the best fit in this study. The bolded values in Tables (14, 15, 16, 17, 18) signify the best values each measure had achieved on the corresponding metric.
Discussion
The discussion revolves around two key points. First, the measure performance stability over both datasets. Second, in which the number of features each measure has performed the best in terms of accuracy (ACC), fmeasure (FM), and average mean precisions (AMP).
Classification performance stability
Based on the results given in Tables (5, 6, 7, 8, 9, 10, 11, 12, 13), 19 observed that the most stable measures on both datasets based on the points every measure has achieved on each number of features. It is concluded that the more points each measure achieves, the more stable it is. In the next Table, R and W indicate Reuters and WebKB datasets respectively.
Table 19 shows that the most stable measures were STBSM PDSM and Cosine with 48, 45, and 45 points respectively. While PDSM was more stable on webkb than Cosine, Cosine was more stable on Reuters than PDSM. However, while Table 19 gives the stable measures according to its giving one plus for each measure when a measure has been superior in terms of specific criterions (out of three criterions, namely, ACC, FM, and AMP) on each dataset, the real numbers in results drawn in Tables (5, 6, 7, 8, 9, 10, 11, 12, 13) also showed indisputably that the top performer measures are also STBSM, PDSM, Cosine, and Jaccard. It can also be deduced from the results that, unlike Reuters upon which measures have unstable performance, all measures on webkb have almost stable performance, chiefly the top performers. On Reuters, the competition was held between STBSM, PDSM, and Cosine. Moreover, from Table 13, based on points recorded on each NF, it can be concluded that STBSM, PDSM, Cosine, and Jaccard can be used effectively for low, middle and high dimensional datasets as these measures performed well on each NF value. Euclidean and Manhattan also performed well on low dimensional datasets (NF in [10–200] features). Bhattacharya was observed to behave well on middle and high dimensional datasets (NF in [200N] features).
Strictly speaking, the highest performance over both datasets was seen for both STBSM and PDSM as both measures showed almost stable and close performance on all values of NF when the poor performance was seen for kullback–Leibler, Manhattan and Euclidean chiefly on high dimensional datasets.
Classificationperformance climax
Second, in which the number of features, the similarity measure performance was at its climax in terms of accuracy, fmeasure, and average mean precisions.
Performance analysisreuters
The next Figs. (1, 2, 3) hold the map of the criterions movements (results were averaged) for all measures over several NF values.
Figure 1a depicts that Manhattan followed by Euclidean and kullback–Leibler did not have a stable accuracy and performed the worst as NF grew. In contrast, STBSM, PDSM, and Bhattacharya had the most stable performance. The Cosine and Jaccard showed a punctuated accuracy while NF grew from 10 to 3000, and then started to be slightly declined as NF grew. Figure 1b draws the top competitors which were STBSM, PDSM, and Cosine with STBSM being superior.
Figure 2a shows that Euclidean and Manhattan had a higher FM when NF was in the range of [10–100]. However, their performance started deteriorating as NF grew. Like accuracy movement, STBSM, PDSM, Cosine, Bhattacharya, and Jaccard yielded an almost stable FM while NF grew from 10 to N, with STBSM and PDSM being the best, and Bhattacharya bettered Jaccard. Bhattacharya’s performance declined slightly for the favor of STBSM, PDSM, and Cosine, though. Finally, kullback–Leibler was shown to perform the worst. Figure 2b draws the top competitors which were STBSM, Cosine, and PDSM with STBSM being superior over all of them, and Cosine superior over PDSM.
Finally, from Fig. 3a, it is noted that Euclidean followed by Manhattan and kullback–Leibler had the worst performance in terms of AMP. On the other hand, STBSM, PDSM, Cosine, and Bhattacharya drew the best performance. The Jaccard measure was seen to represent a middle ground between those of the highest AMP and those of lowest AMP. It is worth noting that all measures of the highest performance were seen effective from NF = 10 till NF = 3000, and their effectiveness keeps improving on all features. However, as NF surpassed 3000 to reach 6000 or bigger, their performance started to lower slightly as shown in Fig. 3b.
Performance analysiswebKb
The next Figs. (4, 5, 6) hold the map of the criterions movements (results were averaged) for all measures over several NF values.
From Fig. 4a, it can be seen that Manhattan followed by kullback–Leibler had got an almost stable accuracy albeit the fact that they performed poorly as NF grew. STBSM, PDSM, Cosine, and Jaccard showed a clear stable higher accuracy while NF grew from 10 to all features, with STBSM and PDSM being highly superior. While STBSM outweighed PDSM when NF is in the range [50–3000], PDSM outweighed STB_SM from 6000 to all features as shown in Fig. 4b both measures intersected at 3000 features, though. In addition, on average, STBSM was still taking the lead. Manhattan and Euclidean had a close performance from each other when NF was in the range [10–200]. However, as NF grew, Euclidean outperformed Manhattan, while seen more closely to Bhattacharya when NF in [350–33,025].
From Fig. 5a, it is shown that Manhattan and kullback–Leibler had the worst performance albeit the fact that kullback–Leibler was seen closer to Cosine when NF = 10. It was a rare case, though. Similarly to Fig. 4b, PDSM, Fig. 5b exhibits that PDSM, STBSM and Cosine were the best performance with PDSM and STBSM being fiercely rivals. On the other hand, Jaccard, and Euclidean outperformed Manhattan and kullback–Leibler and Bhattacharya as NF was in the range [50–6000]. However, as NF grew bigger, Bhattacharya started to show a graduallyincreasing performance over Euclidean.
Finally, from Fig. 6a, it is clear that Manhattan, Bhattacharya followed by kullback–Leibler had the worst performance in terms of AMP albeit the fact that Manhattan had higher AMP when NF was in [10100]. When NF was in the range [50200], Bhattacharya followed by kullback–Leibler were seen to have the worst AMP values. As NF grew, Bhattacharya started to have better performance over Manhattan, though. Similarly, Euclidean outweighed Jaccard as NF was in the range [10–350]. However, as NF grew, Jaccard behaved better than Euclidean. Similar to Fig. 5b, Fig. 6b exhibited that PDSM, STBSM and Cosine had the best performance with PDSM and STBSM being highly rivals.
Classificationexecution time analysis
Finally, the time consumed by each measure on each dataset over each NF was accumulated and averaged to show which one runs the fastest and which one runs the slowest. A certain measure could give higher accuracy and desired performance while it ran slower compared with others and vice versa. The next Figures map the time taken by each measure to produce the results. According to execution time drawn in Figs. 7, 8, it is abundantly clear that all measures share one fact: the execution time is growing steadily as NF increases, PDSM in particular. It is worth mentioning that time was calculated as the similarity measure run on all evaluation metrics (six metrics) of classification.
Figure 7 clearly shows that Bhattacharyya and Manhattan were the fastest similarity measures with Manhattan being much faster on all features. However, this came at the expense of the drawnabove results by both measures as it occupied the second and thirdworst measures after kullback–Leibler. Euclidean had been observed to be the middle ground in terms of speed between the first group (Bhattacharyya, Manhattan) and the second group (PDSM, kullback–Leibler, Jaccard, Jaccard, Cosine, and STBSM). Surprisingly, when all features addressed, Manhattan was the fastest measure and PDSM was the slowest measure as it took roughly 1493.85 min on Reuters when NF = all features. On the other hand, closer to Bhattacharyya, Euclidean recorded worse results when compared with Cosine, Jaccard, STBSM, and PDSM. Jaccard, on the other hand, was faster than Cosine and STBSM, and Cosine was slower than STBSM. In order, PDSM, Jaccard followed by Cosine, and STBSM were all observed to be the slower measures comparing with the first group, though. In fact, PDSM has been seen to be the slowest measure.
Similarly to Fig. 7, 8 clearly shows that Bhattacharyya and Manhattan were also the fastest similarity measure with Manhattan being subtly faster on all features. However, similar to Reuters, this came at the expense of the drawnabove results by both measures as it occupied the second and thirdworst measure after kullback–Leibler. Euclidean, on the other hand, had been observed to be the fastest metric when all features considered, and PDSM was seen to be the slowest ever as it took almost 1001.067 min on all features–WebKB. However, like Bhattacharyya, Euclidean recorded to have worse results compared with Cosine, Jaccard, PDSM, and STBSM. Meanwhile, Cosine was faster than Jaccard and STBSM in all NF cases except for the case in which all features were addressed. In this case, Jaccard was faster than both Cosine and STBSM. In general, in order, PDSM, kullback–Leibler, Jaccard, STBSM, and Cosine were the slowest measures with PDSM being the slowest ever.
Clustering analysis
Based on the results drawn in Tables 14, 15, 16, 17, 18, the analysis is done briefly in Table (20) across counting the points each similarity measure had achieved on each metric. The point is counted for measure if it is being bolded as higher value, in Tables 14, 15, 16, 17, 18. The total number of points are 20 points as we have two datasets and five metrics on two values of clustering variable (K = 4, K = 8). In each Table, there has been four points each measure could achieve based on the drawn results. For example Euclidean in Table 4 got 4 points as its results are spotted as top values for purity metric on both datasets on both K (4 and 8). The next Table draws the points recorded for each measure on each metric (Tables 14, 15, 16, 17, 18), and the points in total and rank as well. The bolded values in Table 22 suggest the highest values in Table 20, which reflect the optimality of each measures on the corresponding metric.
In general STBSM behaves better than PDSM on webkb clustering chiefly when k grows. That means STBSM could work optimally on big data, and STBSM enjoys the scalability properties. The scalability case is the case in which dataset grow larger and larger in terms of data, for example. Ironically, unlike classification, PDSM works better than STBSM on Reuters, though. Briefly, according to Tables (14, 15, 16, 17, 18), the order of top performer measures on Reuters was: Euclidean PDSM, Cosine and STBSM, On the other extreme, the order of top performer measures on WebKB was: STBSM, Cosine, Euclidean and PDSM. Strictly speaking, the competition process is fiercely held between STBSM, Euclidean, Cosine and PDSM with STBSM being maximally superior. In other words, according to the real numbers drawn in results (Tables 14, 15, 16, 17, 18), STB_SM has better values than all measures in most cases. For example, in purity webkb, completeness and Rand Index of both datsets, the result values of STBSM are much bigger than those values of other measures. Thus, it can be confidently said that STBSM outperformed all similarity measures significantly in most cases of clustering evaluation metrics.
Clustering–execution time analysis
Based on the time drawn in Tables (21, 22), PDSM was the slowest measure and Manhattan was the fastest measure. As given in Tables (14, 15, 16, 17, 18; 21, 22), our proposed STBSM measure came as a compromised solution for both efficiency and effectiveness. It is worth referring that the clustering time has been calculated while Kmean run on nine evaluation metrics. However, in this work, we just used five metrics. So, this time (drawn time in Tables 21, 22) would be shorter (either slightly or significantly) than the expected time when the Kmeans is running on only these five metrics. That is because adding each metric in clustering often takes extra time, and consequently increase clustering time either slight or significantly. Nevertheless, this claim has not refuted or contradicted the final conclusion drawn in this paper on the speed or slowness of each similarity measure.
The applicability of proposed measure (STBSM) on big data environment
Since the advent of Internet, the size of textual information keeps growing because of the continuous evolution of information technologies. These technologies have allowed massive volumes of data to be exponentially increasing across the online contents like all kinds of webpages (academic, scientific, news, medical, etc.), blogs, social networking like Facebook and twitter, and Youtube. In daily basis, trillions of bytes of data are generated that 90% of data in the world was thought to be existed in last couple of years [34, 35]. Consequently, this fast growing of data volumes has led to a critical information retrieval problems. Among these problems is how to get the relevant document(s) of interest amid such gigantic volumes of textual data and information. To solve such problem, the clustering as a data mining technique come for analyzing these massive volumes of data which is called “Big Data”. Without the clustering and classification, it is challenging to manage and discover the knowledge in the environment of big data. However, there have been difficulties for implementing clustering algorithms to big data as clustering algorithms accompanied with high computational costs and complexity. To make it worse, recently, emanation of big data (with all its characteristics including volumes, variety, velocity, variability and complexity) draws more difficulties to this issue which pushes more studies and research to find every possible way to improve clustering algorithms.
That lead us to the question of how to overcome this dilemma, and how to apply clustering algorithms to big data while obtaining the results in a reasonable time. One possible solution to improve the performance of clustering to get results of higher accuracy in reasonable time is to use the welldesigned timeefficient similarity measure. In fact, the performance of clustering and classification is maximally dependent on the similarity measure in use as we have seen in this work in PDSM and STBSM cases. Despite the fact that both measures are effective, PDSM is seen timeinefficient chiefly when used to clustering purpose. Unlike PDSM, STBSM is timeefficient making it a promising measure for scalability of clustering.
Presently, similarity measure have been sought to mainly promote the accuracy of classification and clustering as well as the efficiency with the intended techniques like KNN classifier and kmeans clustering algorithm. Therefore, in this work, we proposed a similarity measure which is thought to be capable of handling the big data analysis effectively and efficiently. Based on the results drawn for both classification and clustering in particular, we believe that our proposed measure (STBSM) is promising to be an effective technique to process voluminous data in reasonable time with higher accuracy. When we applied STBSM on all features of each dataset to perform clustering, STBSM drew highly competitive results in a reasonable time comparing with all state of art. That means STBSM enjoys it is being significantly effective and maximally efficient and would add a valuedcontribution to the field of information retrieval (which is vital part of big data) in particular and machine learning in general. In fact, while designing STBSM, our focus has been on drawing the measure that would help scale up and expedite clustering algorithm without sacrificing results quality. In doing so, the clustering process will enjoy flexibility and provide faster response time at the same time. In other words, with the proposed measure (STBSM) being effective and efficient, the clustering for big data (including document clustering) can be efficaciously implemented to enhance the speed of search, precision, recall, search engines, and so on.
Conclusions and future work
Using the BoW model, KNN classifier, and Kmeans algorithm, in the context of text classification and clustering, this paper introduces a new similarity measure that is based on the set theory mechanism and is named the STBSM. Besides the STBSM, a comparative study has thoroughly been carried out on seven similarity measures using six classification criterions and five clustering metrics. The obtained results demonstrated that STBSM similarity measure achieved almost the best performance on all classification and clustering criterions on both datasets (Reuters21 or WebKB). Moreover, to stress proposed measure superiority, it was imperative to utilize more than one performance criterion to effectively assess all similarity measures. In fact, it was difficult to determine which measure was the optimal one for any dataset and/evaluation criterion unless they are all evaluated against each other comprehensively. Because of that, each dataset displayed different characteristics when classification or clustering were performed on them. Nonetheless, from the obtained results, it can be concluded that STBSM, PDSM, Cosine, and Jaccard showed superiority over other measures, and obtained the most stable performance trends on both datasets for all K values, compared to Euclidean, Manhattan, and kullback–Leibler measures with Manhattan and kullback–Leibler being noted to have the worst results. On the other extreme, Euclidean and Bhattacharya had a fluctuating performance which can be classified as a middleground between high performance and poor performance measures.
Additionally, using the Kmeans clustering algorithm, all similarity measures were involved in a fierce clustering competition. All similarity measures were individually used to evaluate Kmeans performance with respect to five evaluation metrics from which three metrics are external and the last two are internal metrics. The STBSM, PDSM, and Euclidean were observed to be the top performers in terms of clustering. The STBSM has outperformed Euclidean and PDSM in most stages of evaluation metrics. It worth mentioning that all these results of clustering were collected and analyzed for the case in which the number of clusters K is taken as number of actual classes in both datasets (4 and 8). Thus, in the followup work, to avoid and biasedness and get a deeper insight into clustering performance, an exhaustive analysis with several K values on different clustering algorithms will be carried out.
All these measures were rigorously examined with regard to their execution time when classification and clustering are run on either dataset. For classification, results has shown that some measures met the highest speed but at the expense of their overall performance, such as the Bhattacharyya, Manhattan, and Euclidean. On the other hand, and to confirm the fact that the tradeoff is unescapable, PDSM had been able to achieve better effectiveness results but again at the expense of its efficiency as this measure in particular was the slowest measure. Nevertheless, the STBSM, Jaccard, and Cosine measures were a suitable compromised solution between the fastest measures (the Bhattacharyya, Manhattan, and Euclidean) and the slowest measure (the PDSM). They were not only faster than the PDSM but they were also closer to the speed of the fastest measures. On the other hand, for clustering, the PDSM was also the slowest measure and Manhattan was the fastest measure. As a compromised solution for both effectiveness and efficiency on both the classification and the clustering, our proposed measure the STBSM has shown superiority with regard to clustering as well as classification. Finally, this work briefly described the applicability of the STBSM to big data scenarios. In the future work, we plan to broaden the current work to involve more stateoftheart measures such as that described in [3, 4]. Moreover, the behavior of all these measures will thoroughly be examined on different machine learning tasks such as text summarization [36] and plagiarism detection.
Availability of data and materials
Datasets used in this work is publically available, and code is being uploaded on GitHub (https://github.com/aliamer/InformationRetrievalASetTheoryBasedSimilarityMeasureforTextClusteringandClassification).
Abbreviations
 IR:

Information Retrieval
 NF:

Number of features
 ACC:

Accuracy
 PRE:

Precision
 REC:

Recall
 FM:

F measure
 GM:

G Measure
 AMP:

Average mean precision
References
 1.
Alvarez, J.E. and H. Bast, A review of word embedding and document similarity algorithms applied to academic text. Bachelor thesis, 2017.
 2.
Oghbaie M, Zanjireh MM. Pairwise document similarity measure based on present term set. J Big Data. 2018;5(1):52.
 3.
Sohangir S, Wang D. Improved sqrtCosine similarity measurement. J Big Data. 2017;4(1):25.
 4.
Lin YS, Jiang JY, Lee SJ. A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng. 2013;26(7):1575–90.
 5.
Xu S. Bayesian Naïve Bayes classifiers to text classification. J Inform Sci. 2018;44(1):48–59.
 6.
Sheydaei N, Saraee M, Shahgholian A. A novel feature selection method for text classification using association rules and clustering. J Inform Sci. 2015;41(1):3–15.
 7.
Subhashini R, Kumar VJ. Evaluating the performance of similarity measures used in document clustering and information retrieval. In: 1st Int Conf integrated intelligent computing, Bangalore, 2010, p. 27–31. https://doi.org/10.1109/iciic.20https://doi.org/10.42.
 8.
Amer AA. On Kmeans clusteringbased approach for DDBSs design. J Big Data. 2020;7(1):1–31.
 9.
Amer AA, Mohamed MH, Asri K. ASGOP: An aggregated similaritybased greedyoriented approach for relational DDBSs design. Heliyon. 2020;6(1):e03172.
 10.
Nguyen L, Amer AA. Advanced cosine measures for collaborative filtering. Adapt Personalization (ADP). 2019;1:21–41.
 11.
Shahmirzadi O, Lugowski A, Younge K. Text similarity in vector space models: a comparative study. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). 2019. IEEE.
 12.
Strehl A, Ghosh J, Mooney R. Impact of similarity measures on webpage clustering. In Workshop on artificial intelligence for web search (AAAI 2000). 2000.
 13.
White RW, Jose JM. A study of topic similarity measures. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. 2004.
 14.
Huang A. Similarity measures for text document clustering. In Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand. 2008.
 15.
Forsyth RS, Sharoff S. Document dissimilarity within and across languages: a benchmarking study. Literary Linguistic Comput. 2014;29(1):6–22.
 16.
Thompson VU, Panchev C, Oakes M. Performance evaluation of similarity measures on similar and dissimilar text retrieval. In 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K). IEEE. 2015.
 17.
Fahad A, et al. A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Topics Comput. 2014;2(3):267–79.
 18.
Aslam JA, Frost M. An informationtheoretic measure for document similarity. In: Proc 26th SIGIR, Toronto. 2003. p. 449–50.
 19.
Zhao Y. R and data mining: examples and case studies. Cambridge: Academic Press; 2012.
 20.
Tata S, Patel JM. Estimating the selectivity of tfidf based Cosine similarity predicates. ACM Sigmod Record. 2007;36(2):7–12.
 21.
Bhattacharyya A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull Calcutta Math Soc. 1943;35:99–109.
 22.
Schoenharl TW, Madey G. Evaluation of measurement techniques for the validation of agentbased simulations against streaming data. In International Conference on Computational Science. 2008. Springer.
 23.
Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86.
 24.
Kullback S. Information theory and statistics Wiley. New York, 1959.
 25.
Jaccard P. The distribution of the flora in the alpine zone. 1. New phytologist, 1912. 11(2): p. 37–50.
 26.
Turney PD, Pantel P. From frequency to meaning: vector space models of semantics. J Artificial Intell Res. 2010;37:141–88.
 27.
AlGhuribi SM, Alshomrani S. A simple study of webpage text classification algorithms for Arabic and English Languages. In 2013 International Conference on IT Convergence and Security (ICITCS). 2013. IEEE.
 28.
Patil DB, Dongre YV. A fuzzy approach for text mining. IJ Math Sci Comput. 2015;4:34–43.
 29.
Salton G, Buckley C. Termweighting approaches in automatic text retrieval. Inf Process Manage. 1988;24(5):513–23.
 30.
Jabalameli, M., A. Arman, and M. Nematbakhsh, Improving the efficiency of term weighting in set of dynamic documents. 2015. International Journal of Modern Education and Computer Science, 7, 4247.
 31.
Aggarwal CC, Zhai C. A survey of text classification algorithms, in mining text data. Boston: Springer; 2012. p. 163–222.
 32.
Lakshmi R, Baskar S. DICDOCKmeans: dissimilaritybased Initial Centroid selection for DOCument clustering using Kmeans for improving the effectiveness of text document clustering. J Inform Sci. 2019;45(6):818–32.
 33.
Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans Neural Networks. 2005;16(3):645–78.
 34.
Khadija A. Almohsen, Huda AlJobori, “Recommender Systems in Light of Big Data”, International Journal of Electrical and Computer Engineering (IJECE), Vol. 5, No. 6, December 2015, pp. 1553–1563, 2015.
 35.
Hoad TC, Zobel J. Methods for identifying versioned and plagiarized documents. JASIST. 2003;54:203–15.
 36.
Nagwani NK. Summarizing large text collection using topic modelling and clustering based on MapReduce framework. J Big Data. 2015;2:6.
Acknowledgements
The authors would like to sincerely express their thanks to both Prof Dr. Loc Nguyen (Loc Nguyen’s Academic Network) and Prof Dr. Abdelmoneim Artoli (King Saud University) for their constructive suggestions while doing this work. The authors would like to heartily extend their eternal thanks to both the Journal of Big Data Team (Editors, in particular) along with those respected anonymous reviewers for their valuable comments and directions that otherwise this work wouldn’t have been improved.
Funding
This research has been supported by Research Incentive Fund (RIF) Grant Activity Code: R19093–Zayed University, UAE.
Author information
Affiliations
Contributions
Both authors have been the key contributors in conception and design, implementing the approach and analyzing results of all experiments, and the preparation, writing and revising the manuscript. Both authors read and approved the final manuscript.
Authors’ information
Ali A. Amer is an assistant professor in Computer Science Department at Taiz University (YEMEN). He has been publishing many research papers in highlyranked and toptier journals as well as refereed International conferences. He has also acted as a reviewer for many topvenue platforms. Of these platforms, he has published in, and reviewed for: ACM Computing Surveys, IEEE Access, Computer in Human Behavior, Journal of Big Data (Springer), International Journal on Semantic Web and Information Systems, Universal Computer Science Journal, Journal of Supercomputing, Heliyon, and Journal of Evolutionary Intelligence, to name a few. His primary interest of research falls into: Information Systems, Database, Distributed and Parallel Database Systems, Data Integration, Data Mining, Network, and Information Retrieval.
Hassan Abdalla is an Associate Professor of Information Systems at the College of Technological Innovation since 2018. He holds a PhD in Information Systems from London, UK. Prior to joining Zayed University, Dr. Abdalla who is an Oracle Certified Professional has worked as an associate professor at the College of Computer and Information Sciences at King Saud University (KSU), Riyadh, Saudi Arabia. He also served there as a head of Quality Unit. Dr. Abdalla has published his research work in many reputable refereed international journals and conferences, he has also served as a reviewer for many topranked Journals. Dr. Abdalla’s research interests include Distributed database systems, Information Retrieval, Knowledge Management and Enterprise Computing.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Appendix 1
The code is now publically available on GitHub.
Appendix 2
In order for readers/researchers to absorb the idea of this work, we just provide a sample for accuracy Tables of STBSM and PDSM similarity measures that would be similarly produced for all similarity measures when the code of this work is used.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Amer, A.A., Abdalla, H.I. A set theory based similarity measure for text clustering and classification. J Big Data 7, 74 (2020). https://doi.org/10.1186/s40537020003443
Received:
Accepted:
Published:
Keywords
 Information retrieval
 Text retrieval
 Text classification
 Similarity measures
 Empirical study