### Similarity and Dissimilarity

Distance or similarity measures are essential in solving many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in the literature to compare two data distributions. As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.

- Similarity Measure
- Numerical measure of how alike two data objects often fallbetween 0 (no similarity) and 1 (complete similarity)

- Dissimilarity Measure
- Numerical measure of how different two data objects are range from 0 (objects are alike) to \(\infty\)(objects are different)

- Proximity
- refers to a similarity or dissimilarity

### Similarity/Dissimilarity for Simple Attributes

Here, *p* and *q* are the attribute values for two data objects.

Attribute Type | Similarity | Dissimilarity |

Nominal | \(s=\begin{cases} 1 & \text{ if } p=q \\ 0 & \text{ if } p\neq q \end{cases}\) | \(d=\begin{cases} 0 & \text{ if } p=q \\ 1 & \text{ if } p\neq q \end{cases}\) |

Ordinal | \(s=1-\dfrac{\left \| p-q \right \|}{n-1}\) (values mapped to integer 0 to n-1, where n is the number of values) | \(d=\dfrac{\left \| p-q \right \|}{n-1}\) |

Interval or Ratio | \(s=1-\left \| p-q \right \|, s=\frac{1}{1+\left \| p-q \right \|}\) | \(d=\left \| p-q \right \|\) |

**Distance**, such as the Euclidean distance, is a dissimilarity measure and has some well-known properties: Common Properties of Dissimilarity Measures

*d*(*p*,*q*) ≥ 0 for all*p*and*q*, and*d*(*p*,*q*) = 0 if and only if*p*=*q*,*d*(*p*,*q*) =*d(q,p)*for all*p*and*q*,*d*(*p*,*r*) ≤*d*(*p*,*q*) +*d*(*q*,*r*) for all*p*,*q*, and r, where*d*(*p*,*q*) is the distance (dissimilarity) between points (data objects),*p*and*q*.

A distance that satisfies these properties is called a **metric**. Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.

### Euclidean Distance

Assume that we have measurements \(x_{ik}\), \(i = 1 , \ldots , N\), on variables \(k = 1 , \dots , p\)* *(also called attributes).

The Euclidean distance between the *i*th and *j*th objects is

\(d_E(i, j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)

for every pair (i, j) of observations.

The weighted Euclidean distance is:

\(d_{WE}(i, j)=\left(\sum_{k=1}^{p}W_k\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)

If scales of the attributes differ substantially, standardization is necessary.

### Minkowski Distance

The Minkowski distance is a generalization of the Euclidean distance.

With the measurement, \(x _ { i k } , i = 1 , \dots , N , k = 1 , \dots , p\), the Minkowski distance is

\(d_M(i, j)=\left(\sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right)^\frac{1}{\lambda}\)

where \(\lambda \geq 1\). It is also called the \(L_λ\)_{ }metric.

- \(\lambda = 1 : L _ { 1 }\) metric, Manhattan or City-block distance.
- \(\lambda = 2 : L _ { 2 }\)metric, Euclidean distance.
- \(\lambda \rightarrow \infty : L _ { \infty }\) metric, Supremum distance.

\( \lim{\lambda \to \infty}=\left( \sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right) ^\frac{1}{\lambda} =\text{max}\left( \left | x_{i1}-x_{j1}\right| , ... , \left | x_{ip}-x_{jp}\right| \right) \)

Note that λ and p are two different parameters. Dimension of the data matrix remains finite.

#### Mahalanobis Distance

Let **X** be a N × p matrix. Then the \(i^{th}\) row of **X **is

\(x_{i}^{T}=\left( x_{i1}, ... , x_{ip} \right)\)

The Mahalanobis distance is

\(d_{MH}(i, j)=\left( \left( x_i - x_j\right)^T \Sigma^{-1} \left( x_i - x_j\right)\right)^\frac{1}{2}\)

where \(∑\) is the p×p sample covariance matrix.

## Try it! Section

Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.

- Calculate the Euclidan distances.
- Calculate the Minkowski distances (\(\lambda = 1 \text { and } \lambda \rightarrow \infty\) cases).

- Euclidean distances are:
\(d _ { E } ( 1,2 ) = \left( ( 1 - 1 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 1 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 1 ) ^ { 2 } \right) ^ { 1 / 2 } = 3.162\)

\(d _ { E } ( 1,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 2.646\)

\(d _ { E } ( 2,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 1.732\)

- Minkowski distances (when \(\lambda = 1\) ) are:
\(d _ { M } ( 1,2 ) = | 1 - 1 | + | 3 - 2 | + | 1 - 1 | + | 2 - 2 | + | 4 - 1 | = 4\)

\(d _ { M } ( 1,3 ) = | 1 - 2 | + | 3 - 2 | + | 1 - 2 | + | 2 - 2 | + | 4 - 2 | = 5\)

\(d _ { M } ( 2,3 ) = | 1 - 2 | + | 2 - 2 | + | 1 - 2 | + | 2 - 2 | + | 1 - 2 | = 3\)

Minkowski distances \(( \text { when } \lambda \rightarrow \infty )\) are:

\(d _ { M } ( 1,2 ) = \max ( | 1 - 1 | , | 3 - 2 | , | 1 - 1 | , | 2 - 2 | , | 4 - 1 | ) = 3\)

\(d _ { M } ( 1,3 ) = 2 \text { and } d _ { M } ( 2,3 ) = 1\)

- Calculate the Minkowski distance \(( \lambda = 1 , \lambda = 2 , \text { and } \lambda \rightarrow \infty \text { cases) }\) between the first and second objects.
- Calculate the Mahalanobis distance between the first and second objects.

- Minkowski distance is:
\(\lambda = 1 . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12\)

\(\lambda = \text{2. } \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \mathrm { d } _ { \mathrm { E } } ( 1,2 ) = \left( ( 2 - 10 ) ^ { 2 } + ( 3 - 7 ) ^ { 2 } \right) ^ { 1 / 2 } = 8.944\)

\(\lambda \rightarrow \infty . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\)

- \(\lambda = \text{1 .} \operatorname { d_M } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12 . \lambda = \text{2 .} \operatorname { d_M } ( 1,2 ) = \operatorname { dE } ( 1,2 ) = ( ( 2 - 10 ) 2 + ( 3 - 7 ) 2 ) 1 / 2 = 8.944 . \lambda \rightarrow \infty\). \(\operatorname { d_M } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\). Since \(\Sigma = \left( \begin{array} { l l } { 19 } & { 11 } \\ { 11 } & { 7 } \end{array} \right)\)we have \(\Sigma ^ { - 1 } = \left( \begin{array} { c c } { 7 / 12 } & { - 11 / 12 } \\ { - 11 / 12 } & { 19 / 12 } \end{array} \right)\)Mahalanobis distance is: \(d _ { M H } ( 1,2 ) = 2\)

R code for Mahalanobis distance

`# Mahalanobis distance calculation d1 = c(2, 3) # each observation d2 = c(10, 7) d3 = c(3, 2) # Get covariance matrix using "ALL" observations cov_all = cov(rbind(d1, d2, d3)) cov_all # Inverse covariance matrix is given by: solve(cov_all) # Mahalanobis distance is given by: Mahalanobis_dist = sqrt( matrix(d1-d2,1,2)%*%solve(cov_all)%*%matrix(d1-d2,2,1) ) Mahalanobis_dist `

### Common Properties of Similarity Measures

Similarities have some well-known properties:

*s*(*p*,*q*) = 1 (or maximum similarity) only if*p*=*q*,*s*(*p*,*q*) =*s*(*q*,*p*) for all*p*and*q*, where*s*(*p*,*q*) is the similarity between data objects,*p*and*q*.

### Similarity Between Two Binary Variables

The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.

q=1 | q=0 | |
---|---|---|

p=1 | n_{1,1} | n_{1,0} |

p=0 | n_{0,1} | n_{0,0} |

Simple Matching and Jaccard Coefficients

- Simple matching coefficient \(= \left( n _ { 1,1 } + n _ { 0,0 } \right) / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } + n _ { 0,0 } \right)\).
- Jaccard coefficient \(= n _ { 1,1 } / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } \right)\).

## Try it! Section

Calculate the answers to the question and then click the icon on the left to reveal the answer.

Given data:

- p = 1 0 0 0 0 0 0 0 0 0
- q = 0 0 0 0 0 0 1 0 0 1

The frequency table is:

q=1 | q=0 | |
---|---|---|

p=1 | 0 | 1 |

p=0 | 2 | 7 |

Calculate the Simple matching coefficient and the Jaccard coefficient.

- Simple matching coefficient = (0 + 7) / (0 + 1 + 2 + 7) = 0.7.
- Jaccard coefficient = 0 / (0 + 1 + 2) = 0.