Similarity and Dissimilarity
Distance or similarity measures are essential in solving many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in the literature to compare two data distributions. As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.
- Similarity Measure
- Numerical measure of how alike two data objects often fallbetween 0 (no similarity) and 1 (complete similarity)
- Dissimilarity Measure
- Numerical measure of how different two data objects are range from 0 (objects are alike) to \(\infty\)(objects are different)
- Proximity
- refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
Here, p and q are the attribute values for two data objects.
Attribute Type | Similarity | Dissimilarity |
Nominal | \(s=\begin{cases} 1 & \text{ if } p=q \\ 0 & \text{ if } p\neq q \end{cases}\) | \(d=\begin{cases} 0 & \text{ if } p=q \\ 1 & \text{ if } p\neq q \end{cases}\) |
Ordinal | \(s=1-\dfrac{\left \| p-q \right \|}{n-1}\) (values mapped to integer 0 to n-1, where n is the number of values) | \(d=\dfrac{\left \| p-q \right \|}{n-1}\) |
Interval or Ratio | \(s=1-\left \| p-q \right \|, s=\frac{1}{1+\left \| p-q \right \|}\) | \(d=\left \| p-q \right \|\) |
Distance, such as the Euclidean distance, is a dissimilarity measure and has some well-known properties: Common Properties of Dissimilarity Measures
- d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
- d(p, q) = d(q,p) for all p and q,
- d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
A distance that satisfies these properties is called a metric. Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.
Euclidean Distance
Assume that we have measurements \(x_{ik}\), \(i = 1 , \ldots , N\), on variables \(k = 1 , \dots , p\) (also called attributes).
The Euclidean distance between the ith and jth objects is
\(d_E(i, j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)
for every pair (i, j) of observations.
The weighted Euclidean distance is:
\(d_{WE}(i, j)=\left(\sum_{k=1}^{p}W_k\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)
If scales of the attributes differ substantially, standardization is necessary.
Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.
With the measurement, \(x _ { i k } , i = 1 , \dots , N , k = 1 , \dots , p\), the Minkowski distance is
\(d_M(i, j)=\left(\sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right)^\frac{1}{\lambda}\)
where \(\lambda \geq 1\). It is also called the \(L_λ\) metric.
- \(\lambda = 1 : L _ { 1 }\) metric, Manhattan or City-block distance.
- \(\lambda = 2 : L _ { 2 }\)metric, Euclidean distance.
- \(\lambda \rightarrow \infty : L _ { \infty }\) metric, Supremum distance.
\( \lim{\lambda \to \infty}=\left( \sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right) ^\frac{1}{\lambda} =\text{max}\left( \left | x_{i1}-x_{j1}\right| , ... , \left | x_{ip}-x_{jp}\right| \right) \)
Note that λ and p are two different parameters. Dimension of the data matrix remains finite.
Mahalanobis Distance
Let X be a N × p matrix. Then the \(i^{th}\) row of X is
\(x_{i}^{T}=\left( x_{i1}, ... , x_{ip} \right)\)
The Mahalanobis distance is
\(d_{MH}(i, j)=\left( \left( x_i - x_j\right)^T \Sigma^{-1} \left( x_i - x_j\right)\right)^\frac{1}{2}\)
where \(∑\) is the p×p sample covariance matrix.
Try it! Section
Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.
- Calculate the Euclidan distances.
- Calculate the Minkowski distances (\(\lambda = 1 \text { and } \lambda \rightarrow \infty\) cases).
- Euclidean distances are:
\(d _ { E } ( 1,2 ) = \left( ( 1 - 1 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 1 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 1 ) ^ { 2 } \right) ^ { 1 / 2 } = 3.162\)
\(d _ { E } ( 1,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 2.646\)
\(d _ { E } ( 2,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 1.732\)
- Minkowski distances (when \(\lambda = 1\) ) are:
\(d _ { M } ( 1,2 ) = | 1 - 1 | + | 3 - 2 | + | 1 - 1 | + | 2 - 2 | + | 4 - 1 | = 4\)
\(d _ { M } ( 1,3 ) = | 1 - 2 | + | 3 - 2 | + | 1 - 2 | + | 2 - 2 | + | 4 - 2 | = 5\)
\(d _ { M } ( 2,3 ) = | 1 - 2 | + | 2 - 2 | + | 1 - 2 | + | 2 - 2 | + | 1 - 2 | = 3\)
Minkowski distances \(( \text { when } \lambda \rightarrow \infty )\) are:
\(d _ { M } ( 1,2 ) = \max ( | 1 - 1 | , | 3 - 2 | , | 1 - 1 | , | 2 - 2 | , | 4 - 1 | ) = 3\)
\(d _ { M } ( 1,3 ) = 2 \text { and } d _ { M } ( 2,3 ) = 1\)
- Calculate the Minkowski distance \(( \lambda = 1 , \lambda = 2 , \text { and } \lambda \rightarrow \infty \text { cases) }\) between the first and second objects.
- Calculate the Mahalanobis distance between the first and second objects.
- Minkowski distance is:
\(\lambda = 1 . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12\)
\(\lambda = \text{2. } \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \mathrm { d } _ { \mathrm { E } } ( 1,2 ) = \left( ( 2 - 10 ) ^ { 2 } + ( 3 - 7 ) ^ { 2 } \right) ^ { 1 / 2 } = 8.944\)
\(\lambda \rightarrow \infty . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\)
- \(\lambda = \text{1 .} \operatorname { d_M } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12 . \lambda = \text{2 .} \operatorname { d_M } ( 1,2 ) = \operatorname { dE } ( 1,2 ) = ( ( 2 - 10 ) 2 + ( 3 - 7 ) 2 ) 1 / 2 = 8.944 . \lambda \rightarrow \infty\). \(\operatorname { d_M } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\). Since \(\Sigma = \left( \begin{array} { l l } { 19 } & { 11 } \\ { 11 } & { 7 } \end{array} \right)\)we have \(\Sigma ^ { - 1 } = \left( \begin{array} { c c } { 7 / 12 } & { - 11 / 12 } \\ { - 11 / 12 } & { 19 / 12 } \end{array} \right)\)Mahalanobis distance is: \(d _ { M H } ( 1,2 ) = 2\)
R code for Mahalanobis distance
# Mahalanobis distance calculation d1 = c(2, 3) # each observation d2 = c(10, 7) d3 = c(3, 2) # Get covariance matrix using "ALL" observations cov_all = cov(rbind(d1, d2, d3)) cov_all # Inverse covariance matrix is given by: solve(cov_all) # Mahalanobis distance is given by: Mahalanobis_dist = sqrt( matrix(d1-d2,1,2)%*%solve(cov_all)%*%matrix(d1-d2,2,1) ) Mahalanobis_dist
Common Properties of Similarity Measures
Similarities have some well-known properties:
- s(p, q) = 1 (or maximum similarity) only if p = q,
- s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data objects, p and q.
Similarity Between Two Binary Variables
The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.
q=1 | q=0 | |
---|---|---|
p=1 | n1,1 | n1,0 |
p=0 | n0,1 | n0,0 |
Simple Matching and Jaccard Coefficients
- Simple matching coefficient \(= \left( n _ { 1,1 } + n _ { 0,0 } \right) / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } + n _ { 0,0 } \right)\).
- Jaccard coefficient \(= n _ { 1,1 } / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } \right)\).
Try it! Section
Calculate the answers to the question and then click the icon on the left to reveal the answer.
Given data:
- p = 1 0 0 0 0 0 0 0 0 0
- q = 0 0 0 0 0 0 1 0 0 1
The frequency table is:
q=1 | q=0 | |
---|---|---|
p=1 | 0 | 1 |
p=0 | 2 | 7 |
Calculate the Simple matching coefficient and the Jaccard coefficient.
- Simple matching coefficient = (0 + 7) / (0 + 1 + 2 + 7) = 0.7.
- Jaccard coefficient = 0 / (0 + 1 + 2) = 0.