Skip to main content

TableĀ 1 Summary of distance metrics

From: Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

Distance metric Equation Explanation
Euclidean distance \(D\left( {i,j} \right) = \sqrt {\sum\nolimits_{i = 1}^{n} {|{\text{Xi}} - {\text{Yi}}|^{2} } }\) Popular distance metric used. Suitable for small datasets
Manhattan distance \(D\left( {i,j} \right) = \sum\nolimits_{i = 1}^{n} {\left| {Xi {-} Yi} \right|}\) Distance based on an absolute value. Measures each partition based on the mediancentre. Suitable for compact clusters
Minkowski distance \({\text{D}}\left( {{\text{i}},{\text{j}}} \right) = \sqrt[p]{{\sum\nolimits_{i = 1}^{d} {|{\text{Xi}} {-}{\text{Yi}} |^{p} } }}\) Features with large values and variances tend to dominate other features. Suitable for numeric datasets
Jaccard distance \(J(U,A) = \frac{{\left| {U \cap A} \right|}}{{\left| {U \cup A} \right|}}\) Used as a measure of similarity. Generally applied to binary values to measure distances between objects