Pairwise document similarity measure based on present term set

Oghbaie, Marzieh; Mohammadi Zanjireh, Morteza

doi:10.1186/s40537-018-0163-2

Methodology
Open access
Published: 26 December 2018

Pairwise document similarity measure based on present term set

Marzieh Oghbaie¹ &
Morteza Mohammadi Zanjireh¹

Journal of Big Data volume 5, Article number: 52 (2018) Cite this article

14k Accesses
31 Citations
Metrics details

Abstract

Measuring pairwise document similarity is an essential operation in various text mining tasks. Most of the similarity measures judge the similarity between two documents based on the term weights and the information content that two documents share in common. However, they are insufficient when there exist several documents with an identical degree of similarity to a particular document. This paper introduces a novel text document similarity measure based on the term weights and the number of terms appeared in at least one of the two documents. The effectiveness of our measure is evaluated on two real-world document collections for a variety of text mining tasks, such as text document classification, clustering, and near-duplicates detection. The performance of our measure is compared with that of some popular measures. The experimental results showed that our proposed similarity measure yields more accurate results.

Introduction

In text mining, a similarity (or distance) measure is the quintessential way to calculate the similarity between two text documents, and is widely used in various Machine Learning (ML) methods, including clustering and classification. ML methods help learn from enormous collections, known as big data [1, 2]. In big data, which includes masses of unstructured data, Information Retrieval (IR) is the dominant form of information access [3]. Among ML methods, classification and clustering help discover patterns and correlations and extract information from large-scale collections [1]. These two techniques also offer benefits to different IR applications. For example, document clustering can be applied to the document collection to improve search speed, precision, and recall or to the search results to provide more effective information presentation to user [3]. Document classification is also used in vertical search engines [4] and sentiment detection [5].

In large-scale collections, one of the challenging issues is to identify documents with high similarity values, known as near-duplicate documents (or near-duplicates) [6,7,8]. Integration of heterogeneous collections, storing multiple copies of the same document, and plagiarism are the main causes for the existence of near-duplicates. These documents increase processing overheads and storage. Detecting and filtering near-duplicates can address these issues and also improve the search quality [6]. Using a similarity measure is a quantitative way to define two documents as near-duplicates [7].

For a collection of text documents, an appropriate document representation model is required, such as Vector Space Model (VSM) [9]. According to VSM, each document is represented as an M-dimensional vector in which each dimension corresponds to a vocabulary term or a feature. The exact ordering of terms in the document is also discarded (the Bag-of-Words Model [3]). The vocabulary includes all terms that appear in the document collection. A term or feature can be a single word, multiple words, a phrase^{Footnote 1} or other indexing units [9, 10]. The weight of a term represents the importance of it in the relevant document and is assigned by a term weighting scheme [11]. Term frequency (tf) [3], inverse document frequency (idf) [12], or multiplication of tf and idf (tf-idf) [13,14,15] are commonly used term weighting schemes. In large-scale text document collections, using VSM results sparse vectors, i.e., most of the term weights in a document vector are zero [16, 17]. High dimensionality can be a problem for computing the similarity between two documents.

Using VSM, there are numerous measures to calculate pairwise document similarity. For instance, Euclidean distance is a geometric measure used to measure the distance between two vectors [18, 19]. Cosine similarity compares two documents with respect to the angle between their vectors [11]. Similar to two previous measures, Manhattan distance is also a geometric measure [20, 21]. For two 0–1 vectors,^{Footnote 2} the Hamming distance [17] is the number of positions at which the stored term weights are different. The Chebyshev distance [16] between two vectors is the greatest of absolute differences along any dimension. A similarity measure for text processing (SMTP) [17] is used for comparing two text documents. SMTP can also be extended to measure the similarity between two document collections. Heidarian and Dinneen [18] proposed a novel geometric measure to determine the similarity level between two documents. An Information-Theoretic measure for document Similarity (IT-Sim) is a similarity measure based on information theory [22]. Based on the Suffix Tree Document (STD) model, Chim and Deng [23] proposed a phrase-based measure to compute the similarity between two documents. Sohangir and Wang [16] proposed a new document similarity measure, named Improved Sqrt-Cosine (ISC) similarity. Jaccard coefficient [24] calculates the ratio of the number of terms used in both documents to the number of terms used in at least one of them.

In context of document classification and clustering, there have been numerous researches on the effectiveness of different similarity measures. For instance, Subhashini et al. [25] evaluated the clustering performance of different measures on three web document collections. The results of their experiment showed that Cosine similarity performs better than Euclidean distance and Jaccard coefficient. Hammouda and Kamel [9] proposed a system for web document clustering. In their system, a phrase-based similarity measure was used to calculate the similarity between two documents. D’hondt et al. [11] proposed a novel dissimilarity measure for document clustering, called pairwise-adaptive. This Cosine-based measure selects the K most important terms of each document based on their weights and reduces the dimensionality to the document’s most important terms. Thus, pairwise-adaptive has lower computational burden and is applicable in high dimensional document collections. Lin et al. [17] used SMTP for text clustering and classification.

Moreover, the effect of various similarity measures on the performance of near-duplicates detection has been investigated in recent researches. For instance, Rezaeian and Novikova [8] used Hamming distance to discover plagiarism in Russian texts. Xiao et al. [7] proposed a new algorithm, which is useful in detecting near-duplicate documents. They adopted their proposed algorithm to commonly used similarity measures, such as Jaccard coefficient, Cosine similarity, and Hamming distance. Hajishirzi et al. [26] proposed a new vector representation for documents. Using their method, Jaccard coefficient and Cosine similarity provide higher near-duplicates detection accuracy.

As explained in more detail in “Proposed similarity measure” section, most of the similarity measures judge the closeness of two documents to each other based on the term weights. Although, term weights provide important information about the similarity between two documents, sometimes similarity judgment based on the term weights alone is not sufficient. The motivation behind the work in this paper is that we believe that pairwise document similarity should be based not only on the term weights but on the number of terms appeared in at least one of the two documents, as well. In this paper, we propose a new measure to compute the similarity between two text documents. In this symmetric measure, the similarity increases as the number of terms used in both documents, present terms, and the information content associated with these terms increases. Furthermore, the terms used in only one of the two documents, presence-absence terms, contribute to the similarity measurement.

We conduct a comprehensive experiment to evaluate the effectiveness of our proposed measure on the performance of several text mining applications, including near-duplicates detection, single-label classification, and K-means like clustering. The obtained results show that our proposed similarity measure improves the efficiency of these algorithms. The rest of this paper is organized as follows: “Related work” presents the background of similarity measures. “Proposed similarity measure” section introduces the proposed similarity measure. “Methods” and “Results and discussion” sections discuss experiment details and experimental results, respectively. Finally, the conclusion and the discussion about future work are given in the last section.

Related work

In this paper, the VSM is selected as the document representation model; $ \vec{d} $ is the vector of document d. Furthermore, M and N indicate the size of vocabulary and the number of documents in the document collection, respectively. Some of the measures have been briefly reported in the previous section, but this section covers some popular measures for computing the similarity between two text documents.

Cosine similarity [16] computes the Cosine of the angle between $ \overrightarrow {{d_{1} }} $ and $ \overrightarrow {{d_{2} }} $:

$$ Cosine{\text{-}}similarity \,\left( {d_{1} ,d_{2} } \right) = \frac{{\overrightarrow {{d_{1} }} \cdot \overrightarrow {{d_{2} }} }}{{\left| {\overrightarrow {{d_{1} }} } \right|\cdot \left| {\overrightarrow {{d_{2} }} } \right|}} $$

(1)

where $ \left| {\vec{d}} \right| $ is the Euclidean norm of the vector $ \vec{d} $ and $ \overrightarrow {{d_{1} }} . \ \overrightarrow {{d_{2} }} $ denotes the inner product of $ \overrightarrow {{d_{1} }} $ and $ \overrightarrow {{d_{2} }} $. Using positive term weights, Cosine similarity ranges from 0 to 1.

Euclidean distance [27] between two documents is defined as follows (w_ji is the ith term weight in document j):

$$ Euclidian{\text{-}}distance\,\left( {d_{1} , d_{2} } \right) = \sqrt {\mathop \sum\limits_{i = 1}^{M} \left( {w_{1i} - w_{2i} } \right)^{2} } $$

(2)

Manhattan distance or “sum-norm” [28] computes the sum of absolute differences between the respective coordinates of two document vectors:

$$ Manhattan{\text{-}}distance\,\left( {d_{1} ,d_{2} } \right) = \mathop \sum \limits_{i = 1}^{M} \left| {w_{1i} - w_{2i} } \right| $$

(3)

Chebyshev distance or “max-norm” [19] is the greatest of absolute differences along any coordinate of two document vectors:

$$ Chebyshev{\text{-}}distance\,\left( {d_{1} ,d_{2} } \right) = MAX\left( {\left| {w_{1i} - w_{2i} } \right|} \right) $$

(4)

Jaccard coefficient [29] measures the overlap between two documents each represented by a set:

$$ J\left( {d_{1} , d_{2} } \right) = \frac{{\left| {d_{1} \cap d_{2} } \right|}}{{\left| {d_{1} \cup d_{2} } \right|}} .$$

(5)

Using real-valued features, Extended Jaccard coefficient (EJ) [24] is a length-dependent similarity measure:

$$ EJ\left( {d_{1} , d_{2} } \right) = \frac{{\overrightarrow {{d_{1} }} \cdot \overrightarrow {{d_{2} }} }}{{\overrightarrow {{d_{1} }} \cdot \overrightarrow {{d_{1} }} + \overrightarrow {{d_{2} }} \cdot \overrightarrow {{d_{2} }} - \overrightarrow {{d_{1} }} \cdot \overrightarrow {{d_{2} }} }} $$

(6)

Lin [28] proposed an information-theoretic definition of similarity. According to his definition, the similarity between two objects is the ratio of information which two objects share in common to the information describing both objects. IT-Sim [22] was proposed with respect to Lin’s definition of similarity:

$$ IT\_Sim\left( {d_{1} , d_{2} } \right) = \frac{{2 \cdot \mathop \sum \nolimits_{t} min\left\{ {p_{{d_{1} , t}} , p_{{d_{2} ,t}} } \right\} \cdot log\pi \left( t \right)}}{{\mathop \sum \nolimits_{t} p_{{d_{1} , t}} log\pi \left( t \right) + \mathop \sum \nolimits_{t} p_{{d_{2} ,t}} log\pi \left( t \right)}} $$

(7)

where $ p_{d, t} $ is the normalized occurrence of term t in document d and π(t) is the fraction of the collection documents containing term t.

SMTP [17] calculates the similarity between two documents and considers a constant penalty, λ, for any presence-absence term:

$$ SMTP\left( {d_{1} ,d_{2} } \right) = \frac{{F\left( {d_{1} , d_{2} } \right) + \lambda }}{\lambda + 1} $$

(8)

$$ F\left( {d_{1} , d_{2} } \right) = \frac{{\mathop \sum \nolimits_{i = 1}^{M} N_{*} \left( {d_{1i} ,d_{2i} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{M} N_{\mathop \cup \nolimits } \left( {d_{1i} ,d_{2i} } \right)}} $$

(9)

$$ N_{*} \left( {d_{1i} ,d_{2i} } \right) = \left\{ {\begin{array}{ll} 0.5\, \left(1 + exp\left\{ { - \left( {\frac{{d_{1i} - d_{2i} }}{{\sigma_{i} }}} \right)^{2} } \right\}\right), &\quad if \;\; d_{1i} d_{2i} > 0 \\ 0, &\quad if \;\; d_{1i} = 0 \ and \ d_{2i} = 0\\ - \lambda , &\quad otherwise \\ \end{array} } \right. $$

(10)

$$ N_{ \cup } \left( {d_{1i} ,d_{2i} } \right) = \left\{ {\begin{array}{ll} 0, &\quad if \;\; d_{1i} = 0 \ and \ d_{2i} = 0 \\ 1, &\quad otherwise \end{array} } \right. $$

(11)

where λ is calculated according to the properties of the document collection and $ \sigma_{i} $ is the standard deviation of all non-zero weights for term i in the training set.

Proposed similarity measure

Problem definition

When comparing two documents, terms can be partitioned into three groups [17]: (1) the terms that occur in both documents (present terms), (2) the terms that occur in only one of the two documents (presence–absence terms), and (3) the terms that occur in none of the two documents (absent terms). According to this partitioning, there are six preferable properties for a document similarity measure [17]:

1.
The presence or absence of a term is more important than the difference between two non-zero weights of a present term.
2.
The similarity degree between two documents should decrease when the difference between two non-zero weights of a present term increases.
3.
The similarity degree should decrease when the number of presence-absence terms between two documents increases.
4.
Two documents are least similar to each other, if there is no present term between them.
5.
The similarity measure should be symmetric:
$$ {\text{Similarity }}\left( {{\text{d}}_{ 1} ,{\text{ d}}_{ 2} } \right) \, = {\text{Similarity }}\left( {{\text{d}}_{ 2} ,{\text{ d}}_{ 1} } \right) $$
6.
The distribution of term weights in the document collection should contribute to the similarity measurement.

According to these properties, the measures described in “Related work” section have one or more deficiencies. For example, Cosine similarity does not satisfy Properties 3 and 6, and Euclidean distance does not meet Properties 1, 3, 4, and 6 [17].

Although SMTP was proposed based on these properties, some drawbacks influence its performance and accuracy: First, SMTP cannot calculate similarity when the standard deviation of a particular term is (or tends to) zero. Nagwani [30] proposed an improvement to cover this imperfection and added a supplementary condition to (10). A new version of $ N_{*} \left( {d_{1i} ,d_{2i} } \right) $ is given in (12) and the equations for $ SMTP\left( {d_{1} ,d_{2} } \right) $, $ F\left( {d_{1} , d_{2} } \right) $, and also $ N_{\mathop \cup \nolimits } \left( {d_{1i} ,d_{2i} } \right) $ were not changed.

$$ N_{*} \left( {d_{1i} ,d_{2i} } \right) = \left\{ {\begin{array}{ll} 1, &\quad if \;\; d_{1i} = d_{2i} \,\,and \,\,d_{1i} ,d_{2i} > 0 \\ 0.5\, \left(1 + exp\left\{ { - \left( {\frac{{d_{1i} - d_{2i} }}{{\sigma_{i} }}} \right)^{2} } \right\} \right), &\quad if \;\; d_{1i} d_{2i} > 0 \\ 0, &\quad if \;\; d_{1i} = 0\,\, and\,\, d_{2i} = 0 \\ - \lambda , &\quad otherwise \end{array} } \right. . $$

(12)

Second, λ is determined according to the properties of document collection, therefore all presence-absence terms have identical importance; that is, the weight of presence-absence terms has no contribution to the similarity measurement and also the weight distribution is only taken into consideration for the present terms.

Careful examination of similar documents in different instances showed that two documents with the highest similarity degree use almost identical term sets. In other words, the more terms two documents have in common, the more similar they are. In such cases, some of the well-known measures cannot recognize the most similar documents. The following example clarifies the case. Suppose three documents d₁, d₂, and d₃ with four terms and tf as term weighting scheme. Further, the terms are not shared by all the documents of the collection. Our goal is to find the most similar document to d₁.

Example 1

$$ \begin{aligned} d_{1} = \langle 2, 3, 3, 7\rangle \hfill \\ d_{2} = \langle 2, 1, 1, 1 \rangle\hfill \\ d_{3} = \langle 0, 0, 0, 5 \rangle\hfill \\ \end{aligned} $$

Using Manhattan distance, the distance between d₁ and d₂ and also the distance between d₁ and d₃ is 10. This same similarity indicates that both d₂ and d₃ share same amount of information with d₁, however, a clear difference can be seen between the number of present terms; d₁ and d₂ use same term set, but d₁ and d₃ only share one term in common. Using Cosine similarity, the similarity between d₁ and d₂ is 0.763, and the similarity between d₁ and d₃ is 0.831. Although d₁ and d₂ both use same term set, Cosine similarity selects d₃ as the most similar document to d₁.

Example 2

In another example, suppose we have three documents as follows:

d₁ = “this garden is different because it only has different types of vegetables.”
d₂ = “different types of vegetables are growing in this garden. This garden only has vegetables.”
d₃ = “different types of vegetables growing in your garden can have different benefits for your health.”

The vocabulary contains 7 terms: 〈 types, different, garden, vegetables, growing, benefits, health 〉, and document vectors are as follows:

$$ \begin{aligned} d_{1} = \langle 1,2,1, 1, 0, 0, 0 \rangle \hfill \\ d_{2} =\langle 1, 1, 2, 2, 1, 0, 0 \rangle \hfill \\ d_{3} = \langle 1,2,1,1,1,1,1 \rangle \hfill \\ \end{aligned}. $$

It is obvious that both documents d₁ and d₂ talk about one types of garden, which contains only vegetables. But the third document talks about types of vegetables which are beneficial for your health. Although, the second document is the most similar one to the first document, EJ and Euclidean distance select d₃ as the most similar document to d₁:

$$ EJ\left( {d_{1} ,d_{2} } \right) = 0.636 < EJ\left( {d_{1} ,d_{3} } \right) = 0.7 $$

$$ Euclidean\left( {d_{1} ,d_{2} } \right) = 2 < Euclidean\left( {d_{1} ,d_{3} } \right) = \sqrt 3. $$

As can be seen from previous examples, the measures described in “Related work” section, such as EJ, Euclidean, Manhattan, and Cosine, are insufficient to determine the most similar documents, because they judge the similarity between two documents based on the term weights alone. Based on these observations, we conclude that an appropriate similarity measure should take the number of present terms into account to achieve more accurate similarity value.

However, the number of present terms offers efficient benefits to the similarity calculation between two documents, but it alone is not sufficient. Third example clarifies this case (we have the same assumption mentioned in Example 1).

Example 3

$$ \begin{aligned} d_{1} = \langle 1, 7, 7, 4 \rangle \hfill \\ d_{2} = \langle 1, 2, 2, 2 \rangle \hfill \\ d_{3} = \langle 1, 6, 6, 5 \rangle \hfill \\ \end{aligned} .$$

In this example, if we want to find the most similar document to d₁, the number of present terms alone cannot help, because d₂ and d₃ both have 4 terms in common with d₁. Using Manhattan distance, the distance between d₁ and d₂ is 12 and the distance between d₁ and d₃ is 3. If we use Cosine similarity, the similarity between d₁ and d₂ is 0.9569 and the similarity between d₁ and d₃ is 0.989. Accordingly, in this example, both Manhattan distance and Cosine similarity accurately recognize the most similar document to d₁. Therefore, it is possible that two documents use same term set but have different contents, because the associated terms have different weights. In such cases, most of the similarity measures can accurately judge the similarity between documents.

The third example shows that the number of present terms alone cannot capture all the similarity information between documents and the term weights are still required. But in cases like Examples 1 and 2, the number of present terms can help to find documents with the highest similarity degree.

Pairwise document similarity measure

In this section, we introduce the proposed similarity measure. As explained earlier, the number of present terms and their weights play an important role in accurately judging the similarity between the documents and help find the most similar ones. The idea is that the shared information content created by more number of present terms is more informative than that created by fewer present terms. In other words, if two documents use similar term sets, they tend to have more similar content and theme.

Based on this idea, we add a new property to the preferable properties mentioned earlier for a similarity measure:

7.
Similarity degree should increase when the number of present terms increases.

We also suggest ignoring 6th property, which is related to the weight distribution of terms in the document collection. According to 6th property, the importance of a term in the document collection (discrimination power [3]) is essential and the term with higher idf is more informative. But this property can be involved in the term weighting scheme rather than the similarity measure. This makes the process of assessment easier and we can also check the effect of different term weighting schemes on the performance of similarity measures. For example, if we use tf-idf as the term weighting scheme, the importance of rare terms, which have higher idf, is considered in similarity measurement. Table 1 shows the lack of preferable properties in the similarity measures mentioned in “Related work” section.

Table 1 Deficiencies of some popular measures according to the preferable properties for a similarity measure

Pairwise document similarity measure based on present term set

Abstract

Introduction

Related work

Proposed similarity measure

Problem definition

Example 1

Example 2

Example 3

Pairwise document similarity measure

Property 1

Property 2

Property 3

Property 4

Property 5

Property 6

Property 7

Example 4

Methods

Applications

K-means clustering

kNN classification

The shingling algorithm

Document collections

Evaluation metrics

Results and discussion

Classification results

Clustering results

Near-duplicates detection results

Conclusion and future research

Notes

Abbreviations

References

Authors’ contributions

Acknowledgements

Competing interests

Availability of data and materials

Funding

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords