9 January 2024

Dissimilarity indices and identities

by markolenik

The two common dissimilarity indices used in microbiome research are the Bray-Curtis and the Jaccard indices. Both can be applied to either abundance data (non-negative vectors) or presence/absence data (binary vectors). The Bray-Curtis index is more commonly used for non-negative vectors and the Jaccard index for binary vectors, however, here I’ll consider only binary vectors. In R they are implemented in the vegan package as

vegdist(x, method="bray", binary=FALSE, diag=FALSE, upper=FALSE, 
        na.rm = FALSE, ...) 

The documentation for the vegdist function is cramped and poorly structured, like most R docs. But it contains some interesting identities that I’d like to derive here.

Note on terminology

The terminology in the context of ecological dissimilarities is all over the place. Very often the terms “metric”, “distance”, and “dissimilarity” are used interchangeably, e.g. Kers et al (2021) writes “Bray-Curtis metric” and “Bray-Curtis dissimilarity”. Only the latter is correct, since the Bray-Curtis does not satisfy the triangle inequality and is therefore not a metric (e.g. see mathworld). The terms “coefficient” and “index” are also used in a hand-wavy way. Throughout the documentation of vegdist() the authors use “index” to mean either “dissimilarity” or “distance”. Wikipedia uses “index” and means “coefficient” or “similarity”, e.g. in the Jaccard index, where

$\text{dissimilarity} = 1 - \text{similarity}$

Bray-Curtis dissimilarity

The Bray-Curtis dissimilarity $d_B$ for binary vectors $\bold{x}, \bold{y} \in \{0, 1\}^N$ is informally (e.g. in wikipedia) defined as

$d_B(\bold x, \bold y) := 1 - \frac{2 C_{xy}}{S_x + S_y}$

where $C_{xy}$ is the number of common species between $\bold x$ and $\bold y$ , and $S_x$ and $S_y$ are the total number of species in $\bold x$ and $\bold y$ , respectively. A more explicit definition is

$d_B(\bold{x}, \bold{y}) := 1 - \frac{2 \sum\limits_i \min{(x_i, y_i)}}{\sum\limits_i(x_i + y_i)} \overset{(1)}{=} \frac{\sum\limits_i |x_i - y_i|}{\sum\limits_i (x_i + y_i)}$

The identity $(1)$ can be shown using the properties of the absolute value function:

$|x - y| = \max{(x, y)} - \min{(x, y)}$

Then

$\begin{aligned} 1 - \frac{2 \sum\limits_i \min{(x_i, y_i)}}{\sum\limits_i(x_i + y_i)} & = \frac{\sum\limits_i \big(x_i + y_i- 2 \min{(x_i, y_i)}\big)}{\sum\limits_i(x_i + y_i)}\\ &= \frac{\sum\limits_i \big(x_i + y_i- 2 \min{(x_i, y_i)} + \max{(x_i, y_i)} - \max{(x_i, y_i)}\big)}{\sum\limits_i(x_i + y_i)} \\ &= \frac{\sum\limits_i \big(x_i + y_i- \max{(x_i, y_i)} - \min{(x_i, y_i)} + |x_i - y_i| \big)}{\sum\limits_i(x_i + y_i)}\\ &= \frac{\sum\limits_i \big(|x_i - y_i| \big)}{\sum\limits_i(x_i + y_i)}\; \blacksquare \end{aligned}$

The last step follows from the fact that

$\max{(x_i, y_i)} + \min{(x_i, y_i)} = x_i + y_i$

Jaccard distance

The Jaccard distance is defined as

$d_J(\bold x, \bold y) := 1 - \frac{\sum\limits_i \min{(x_i, y_i)}}{\sum\limits_i\max{(x_i, y_i)}}$

According to vegdist it can be expressed in terms of the Bray-Curtis dissimilarity as

$d_J(\bold x, \bold y) = \frac{2\; d_B(\bold x, \bold y)}{1 + d_B(\bold x, \bold y)}$

Substituting the definition of $d_B$ and simplifying we get

$d_J(\bold x, \bold y) = \frac{2 \sum \limits_i |x_i - y_i|}{\sum \limits_i (x_i + y_i) + \sum \limits_i |x_i - y_i|}$

To prove this identity we start by simplifying the numerator as follows

$2 \sum_i |x_i - y_i| = 2 \sum_i \max{(x_i, y_i)} - 2 \sum_i \min{(x_i, y_i)}$

We simplify the denominator as follows

$\begin{aligned} &\sum \limits_i (x_i + y_i) + \sum \limits_i |x_i - y_i| =\sum \limits_i \big( x_i + y_i + \max{(x_i, y_i)} - \min{(x_i, y_i)} \big) =\\[1.5em] &\sum \limits_i \big( x_i + y_i + \max{(x_i, y_i)} - \min{(x_i, y_i)} + \max{(x_i, y_i)} - \max{(x_i, y_i)} \big) =\\[1.5em] &2\sum \limits_i \max{(x_i, y_i)} \end{aligned}$

Putting everything together we get

$\frac{2 \sum \limits_i \max{(x_i, y_i)} - 2 \sum \limits_i \min{(x_i, y_i)}}{ 2\sum \limits_i \max{(x_i, y_i)} } = 1 - \frac{\sum \limits_i \min{(x_i, y_i)}}{\sum \limits_i \max{(x_i, y_i)}} \; \blacksquare$

thus proving the identity.

Tanimoto distance

The Tanimoto distance, also called “generalized Jaccard distance”, is another common distance found in theoretical ecology. It is defined using dot products as

$d_T(\bold x, \bold y) := 1 - \frac{\bold x \cdot \bold y}{\bold x \cdot \bold x + \bold y \cdot \bold y - \bold x \cdot \bold y}$

The Tanimoto distance is equal to the Jaccard distance for binary vectors:

$\begin{aligned} d_J(\bold x, \bold y) &= d_T(\bold x, \bold y)\\[1.5em] 1 - \frac{\sum \min(x_i, y_i)}{\sum \max(x_i, y_i)} &= 1 - \frac{\bold x \cdot \bold y}{\bold x \cdot \bold x + \bold y \cdot \bold y - \bold x \cdot \bold y} \end{aligned}$

This identity follows from the fact that for binary vectors the dot product can be simplified to

$\begin{aligned} \bold x \cdot \bold y &= \sum \min{(x_i, y_i)}\\ \bold x \cdot \bold x &= \sum x_i \end{aligned}$

Simplifying $d_T$ we get

$\begin{aligned} d_T(\mathbf{x}, \mathbf{y}) &= 1 -\frac{\sum \min(x_i, y_i)}{\sum x_i + \sum y_i - \sum \min(x_i, y_i)} \\[1.5em] &=1 - \frac{\sum \min(x_i, y_i)}{\sum \big( x_i + y_i - \min(x_i, y_i) + \max(x_i, y_i) - \max(x_i, y_i)\big )} \\[1.5em] &=1 - \frac{\sum \min(x_i, y_i)}{\sum \max(x_i, y_i)} = d_J(\mathbf{x}, \mathbf{y}) \; \blacksquare \end{aligned}$

Dissimilarity indices and identities

Note on terminology

Bray-Curtis dissimilarity

Jaccard distance

Tanimoto distance

References