1(b).2.1: Measures of Similarity and Dissimilarity | STAT 508 (2023)

Similarity and Dissimilarity

Distance or similarity measures are essential in solving many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in the literature to compare two data distributions. As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.

Similarity Measure
Numerical measure of how alike two data objects often fallbetween 0 (no similarity) and 1 (complete similarity)
Dissimilarity Measure
Numerical measure of how different two data objects are range from 0 (objects are alike) to \(\infty\)(objects are different)
Proximity
refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes

Here, p and q are the attribute values for two data objects.

Attribute TypeSimilarityDissimilarity
Nominal\(s=\begin{cases}
1 & \text{ if } p=q \\
0 & \text{ if } p\neq q
\end{cases}\)
\(d=\begin{cases}
0 & \text{ if } p=q \\
1 & \text{ if } p\neq q
\end{cases}\)
Ordinal

\(s=1-\dfrac{\left \| p-q \right \|}{n-1}\)

(values mapped to integer 0 to n-1, where n is the number of values)

\(d=\dfrac{\left \| p-q \right \|}{n-1}\)
Interval or Ratio\(s=1-\left \| p-q \right \|, s=\frac{1}{1+\left \| p-q \right \|}\)\(d=\left \| p-q \right \|\)


Distance, such as the Euclidean distance, is a dissimilarity measure and has some well-known properties: Common Properties of Dissimilarity Measures

  1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
  2. d(p, q) = d(q,p) for all p and q,
  3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is called a metric. Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.

Euclidean Distance

Assume that we have measurements \(x_{ik}\), \(i = 1 , \ldots , N\), on variables \(k = 1 , \dots , p\) (also called attributes).

The Euclidean distance between the ith and jth objects is

\(d_E(i, j)=\left(\sum_{k=1}^{p}\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)

for every pair (i, j) of observations.

The weighted Euclidean distance is:

\(d_{WE}(i, j)=\left(\sum_{k=1}^{p}W_k\left(x_{ik}-x_{jk} \right) ^2\right)^\frac{1}{2}\)

If scales of the attributes differ substantially, standardization is necessary.

Minkowski Distance

The Minkowski distance is a generalization of the Euclidean distance.

With the measurement, \(x _ { i k } , i = 1 , \dots , N , k = 1 , \dots , p\), the Minkowski distance is

\(d_M(i, j)=\left(\sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right)^\frac{1}{\lambda}\)

where \(\lambda \geq 1\). It is also called the \(L_λ\) metric.

  • \(\lambda = 1 : L _ { 1 }\) metric, Manhattan or City-block distance.
  • \(\lambda = 2 : L _ { 2 }\)metric, Euclidean distance.
  • \(\lambda \rightarrow \infty : L _ { \infty }\) metric, Supremum distance.

\( \lim{\lambda \to \infty}=\left( \sum_{k=1}^{p}\left | x_{ik}-x_{jk} \right | ^ \lambda \right) ^\frac{1}{\lambda} =\text{max}\left( \left | x_{i1}-x_{j1}\right| , ... , \left | x_{ip}-x_{jp}\right| \right) \)

Note that λ and p are two different parameters. Dimension of the data matrix remains finite.

Mahalanobis Distance

Let X be a N × p matrix. Then the \(i^{th}\) row of X is

\(x_{i}^{T}=\left( x_{i1}, ... , x_{ip} \right)\)

The Mahalanobis distance is

\(d_{MH}(i, j)=\left( \left( x_i - x_j\right)^T \Sigma^{-1} \left( x_i - x_j\right)\right)^\frac{1}{2}\)

where \(∑\) is the p×p sample covariance matrix.

Try it! Section

Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.

  1. Calculate the Euclidan distances.
  2. Calculate the Minkowski distances (\(\lambda = 1 \text { and } \lambda \rightarrow \infty\) cases).
  1. Euclidean distances are:

    \(d _ { E } ( 1,2 ) = \left( ( 1 - 1 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 1 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 1 ) ^ { 2 } \right) ^ { 1 / 2 } = 3.162\)

    \(d _ { E } ( 1,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 3 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 4 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 2.646\)

    \(d _ { E } ( 2,3 ) = \left( ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } + ( 2 - 2 ) ^ { 2 } + ( 1 - 2 ) ^ { 2 } \right) ^ { 1 / 2 } = 1.732\)

  2. Minkowski distances (when \(\lambda = 1\) ) are:

    \(d _ { M } ( 1,2 ) = | 1 - 1 | + | 3 - 2 | + | 1 - 1 | + | 2 - 2 | + | 4 - 1 | = 4\)

    \(d _ { M } ( 1,3 ) = | 1 - 2 | + | 3 - 2 | + | 1 - 2 | + | 2 - 2 | + | 4 - 2 | = 5\)

    \(d _ { M } ( 2,3 ) = | 1 - 2 | + | 2 - 2 | + | 1 - 2 | + | 2 - 2 | + | 1 - 2 | = 3\)

    Minkowski distances \(( \text { when } \lambda \rightarrow \infty )\) are:

    \(d _ { M } ( 1,2 ) = \max ( | 1 - 1 | , | 3 - 2 | , | 1 - 1 | , | 2 - 2 | , | 4 - 1 | ) = 3\)

    \(d _ { M } ( 1,3 ) = 2 \text { and } d _ { M } ( 2,3 ) = 1\)

  1. Calculate the Minkowski distance \(( \lambda = 1 , \lambda = 2 , \text { and } \lambda \rightarrow \infty \text { cases) }\) between the first and second objects.
  2. Calculate the Mahalanobis distance between the first and second objects.
    1. Minkowski distance is:

      \(\lambda = 1 . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12\)

      \(\lambda = \text{2. } \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \mathrm { d } _ { \mathrm { E } } ( 1,2 ) = \left( ( 2 - 10 ) ^ { 2 } + ( 3 - 7 ) ^ { 2 } \right) ^ { 1 / 2 } = 8.944\)

      \(\lambda \rightarrow \infty . \mathrm { d } _ { \mathrm { M } } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\)

    2. \(\lambda = \text{1 .} \operatorname { d_M } ( 1,2 ) = | 2 - 10 | + | 3 - 7 | = 12 . \lambda = \text{2 .} \operatorname { d_M } ( 1,2 ) = \operatorname { dE } ( 1,2 ) = ( ( 2 - 10 ) 2 + ( 3 - 7 ) 2 ) 1 / 2 = 8.944 . \lambda \rightarrow \infty\). \(\operatorname { d_M } ( 1,2 ) = \max ( | 2 - 10 | , | 3 - 7 | ) = 8\). Since \(\Sigma = \left( \begin{array} { l l } { 19 } & { 11 } \\ { 11 } & { 7 } \end{array} \right)\)we have \(\Sigma ^ { - 1 } = \left( \begin{array} { c c } { 7 / 12 } & { - 11 / 12 } \\ { - 11 / 12 } & { 19 / 12 } \end{array} \right)\)Mahalanobis distance is: \(d _ { M H } ( 1,2 ) = 2\)

    R code for Mahalanobis distance

    # Mahalanobis distance calculation d1 = c(2, 3) # each observation d2 = c(10, 7) d3 = c(3, 2) # Get covariance matrix using "ALL" observations cov_all = cov(rbind(d1, d2, d3)) cov_all # Inverse covariance matrix is given by: solve(cov_all) # Mahalanobis distance is given by: Mahalanobis_dist = sqrt( matrix(d1-d2,1,2)%*%solve(cov_all)%*%matrix(d1-d2,2,1) ) Mahalanobis_dist 

    Common Properties of Similarity Measures

    Similarities have some well-known properties:

    1. s(p, q) = 1 (or maximum similarity) only if p = q,
    2. s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data objects, p and q.

    Similarity Between Two Binary Variables

    The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.

    q=1q=0
    p=1n1,1n1,0
    p=0n0,1n0,0

    Simple Matching and Jaccard Coefficients

    • Simple matching coefficient \(= \left( n _ { 1,1 } + n _ { 0,0 } \right) / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } + n _ { 0,0 } \right)\).
    • Jaccard coefficient \(= n _ { 1,1 } / \left( n _ { 1,1 } + n _ { 1,0 } + n _ { 0,1 } \right)\).

    Try it! Section

    Calculate the answers to the question and then click the icon on the left to reveal the answer.

    Given data:

    • p = 1 0 0 0 0 0 0 0 0 0
    • q = 0 0 0 0 0 0 1 0 0 1

    The frequency table is:

    q=1q=0
    p=101
    p=027

    Calculate the Simple matching coefficient and the Jaccard coefficient.

    • Simple matching coefficient = (0 + 7) / (0 + 1 + 2 + 7) = 0.7.
    • Jaccard coefficient = 0 / (0 + 1 + 2) = 0.

    References

    Top Articles
    Latest Posts
    Article information

    Author: Lidia Grady

    Last Updated: 05/06/2023

    Views: 5834

    Rating: 4.4 / 5 (45 voted)

    Reviews: 84% of readers found this page helpful

    Author information

    Name: Lidia Grady

    Birthday: 1992-01-22

    Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

    Phone: +29914464387516

    Job: Customer Engineer

    Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

    Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.