From: Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop
Distance metric | Equation | Explanation |
---|---|---|
Euclidean distance | \(D\left( {i,j} \right) = \sqrt {\sum\nolimits_{i = 1}^{n} {|{\text{Xi}} - {\text{Yi}}|^{2} } }\) | Popular distance metric used. Suitable for small datasets |
Manhattan distance | \(D\left( {i,j} \right) = \sum\nolimits_{i = 1}^{n} {\left| {Xi {-} Yi} \right|}\) | Distance based on an absolute value. Measures each partition based on the mediancentre. Suitable for compact clusters |
Minkowski distance | \({\text{D}}\left( {{\text{i}},{\text{j}}} \right) = \sqrt[p]{{\sum\nolimits_{i = 1}^{d} {|{\text{Xi}} {-}{\text{Yi}} |^{p} } }}\) | Features with large values and variances tend to dominate other features. Suitable for numeric datasets |
Jaccard distance | \(J(U,A) = \frac{{\left| {U \cap A} \right|}}{{\left| {U \cup A} \right|}}\) | Used as a measure of similarity. Generally applied to binary values to measure distances between objects |