Skip to main content

Table 1 Summary of distance metrics

From: Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

Distance metric

Equation

Explanation

Euclidean distance

\(D\left( {i,j} \right) = \sqrt {\sum\nolimits_{i = 1}^{n} {|{\text{Xi}} - {\text{Yi}}|^{2} } }\)

Popular distance metric used. Suitable for small datasets

Manhattan distance

\(D\left( {i,j} \right) = \sum\nolimits_{i = 1}^{n} {\left| {Xi {-} Yi} \right|}\)

Distance based on an absolute value. Measures each partition based on the mediancentre. Suitable for compact clusters

Minkowski distance

\({\text{D}}\left( {{\text{i}},{\text{j}}} \right) = \sqrt[p]{{\sum\nolimits_{i = 1}^{d} {|{\text{Xi}} {-}{\text{Yi}} |^{p} } }}\)

Features with large values and variances tend to dominate other features. Suitable for numeric datasets

Jaccard distance

\(J(U,A) = \frac{{\left| {U \cap A} \right|}}{{\left| {U \cup A} \right|}}\)

Used as a measure of similarity. Generally applied to binary values to measure distances between objects