...
9 January 2024

Dissimilarity indices and identities

by markolenik

The two common dissimilarity indices used in microbiome research are the Bray-Curtis and the Jaccard indices. Both can be applied to either abundance data (non-negative vectors) or presence/absence data (binary vectors). The Bray-Curtis index is more commonly used for non-negative vectors and the Jaccard index for binary vectors, however, here I’ll consider only binary vectors. In R they are implemented in the vegan package as

vegdist(x, method="bray", binary=FALSE, diag=FALSE, upper=FALSE, 
        na.rm = FALSE, ...) 

The documentation for the vegdist function is cramped and poorly structured, like most R docs. But it contains some interesting identities that I’d like to derive here.

Note on terminology

The terminology in the context of ecological dissimilarities is all over the place. Very often the terms “metric”, “distance”, and “dissimilarity” are used interchangeably, e.g. Kers et al (2021) writes “Bray-Curtis metric” and “Bray-Curtis dissimilarity”. Only the latter is correct, since the Bray-Curtis does not satisfy the triangle inequality and is therefore not a metric (e.g. see mathworld). The terms “coefficient” and “index” are also used in a hand-wavy way. Throughout the documentation of vegdist() the authors use “index” to mean either “dissimilarity” or “distance”. Wikipedia uses “index” and means “coefficient” or “similarity”, e.g. in the Jaccard index, where

dissimilarity=1similarity \text{dissimilarity} = 1 - \text{similarity}

Bray-Curtis dissimilarity

The Bray-Curtis dissimilarity dBd_B for binary vectors x,y{0,1}N\bold{x}, \bold{y} \in \{0, 1\}^N is informally (e.g. in wikipedia) defined as

dB(x,y):=12CxySx+Sy d_B(\bold x, \bold y) := 1 - \frac{2 C_{xy}}{S_x + S_y}

where CxyC_{xy} is the number of common species between x\bold x and y\bold y, and SxS_x and SyS_y are the total number of species in x\bold x and y\bold y, respectively. A more explicit definition is

dB(x,y):=12imin(xi,yi)i(xi+yi)=(1)ixiyii(xi+yi) d_B(\bold{x}, \bold{y}) := 1 - \frac{2 \sum\limits_i \min{(x_i, y_i)}}{\sum\limits_i(x_i + y_i)} \overset{(1)}{=} \frac{\sum\limits_i |x_i - y_i|}{\sum\limits_i (x_i + y_i)}

The identity (1)(1) can be shown using the properties of the absolute value function:

xy=max(x,y)min(x,y) |x - y| = \max{(x, y)} - \min{(x, y)}

Then

12imin(xi,yi)i(xi+yi)=i(xi+yi2min(xi,yi))i(xi+yi)=i(xi+yi2min(xi,yi)+max(xi,yi)max(xi,yi))i(xi+yi)=i(xi+yimax(xi,yi)min(xi,yi)+xiyi)i(xi+yi)=i(xiyi)i(xi+yi)   \begin{aligned} 1 - \frac{2 \sum\limits_i \min{(x_i, y_i)}}{\sum\limits_i(x_i + y_i)} & = \frac{\sum\limits_i \big(x_i + y_i- 2 \min{(x_i, y_i)}\big)}{\sum\limits_i(x_i + y_i)}\\ &= \frac{\sum\limits_i \big(x_i + y_i- 2 \min{(x_i, y_i)} + \max{(x_i, y_i)} - \max{(x_i, y_i)}\big)}{\sum\limits_i(x_i + y_i)} \\ &= \frac{\sum\limits_i \big(x_i + y_i- \max{(x_i, y_i)} - \min{(x_i, y_i)} + |x_i - y_i| \big)}{\sum\limits_i(x_i + y_i)}\\ &= \frac{\sum\limits_i \big(|x_i - y_i| \big)}{\sum\limits_i(x_i + y_i)}\; \blacksquare \end{aligned}

The last step follows from the fact that

max(xi,yi)+min(xi,yi)=xi+yi \max{(x_i, y_i)} + \min{(x_i, y_i)} = x_i + y_i

Jaccard distance

The Jaccard distance is defined as

dJ(x,y):=1imin(xi,yi)imax(xi,yi) d_J(\bold x, \bold y) := 1 - \frac{\sum\limits_i \min{(x_i, y_i)}}{\sum\limits_i\max{(x_i, y_i)}}

According to vegdist it can be expressed in terms of the Bray-Curtis dissimilarity as

dJ(x,y)=2  dB(x,y)1+dB(x,y) d_J(\bold x, \bold y) = \frac{2\; d_B(\bold x, \bold y)}{1 + d_B(\bold x, \bold y)}

Substituting the definition of dBd_B and simplifying we get

dJ(x,y)=2ixiyii(xi+yi)+ixiyi d_J(\bold x, \bold y) = \frac{2 \sum \limits_i |x_i - y_i|}{\sum \limits_i (x_i + y_i) + \sum \limits_i |x_i - y_i|}

To prove this identity we start by simplifying the numerator as follows

2ixiyi=2imax(xi,yi)2imin(xi,yi) 2 \sum_i |x_i - y_i| = 2 \sum_i \max{(x_i, y_i)} - 2 \sum_i \min{(x_i, y_i)}

We simplify the denominator as follows

i(xi+yi)+ixiyi=i(xi+yi+max(xi,yi)min(xi,yi))=i(xi+yi+max(xi,yi)min(xi,yi)+max(xi,yi)max(xi,yi))=2imax(xi,yi) \begin{aligned} &\sum \limits_i (x_i + y_i) + \sum \limits_i |x_i - y_i| =\sum \limits_i \big( x_i + y_i + \max{(x_i, y_i)} - \min{(x_i, y_i)} \big) =\\[1.5em] &\sum \limits_i \big( x_i + y_i + \max{(x_i, y_i)} - \min{(x_i, y_i)} + \max{(x_i, y_i)} - \max{(x_i, y_i)} \big) =\\[1.5em] &2\sum \limits_i \max{(x_i, y_i)} \end{aligned}

Putting everything together we get

2imax(xi,yi)2imin(xi,yi)2imax(xi,yi)=1imin(xi,yi)imax(xi,yi)   \frac{2 \sum \limits_i \max{(x_i, y_i)} - 2 \sum \limits_i \min{(x_i, y_i)}}{ 2\sum \limits_i \max{(x_i, y_i)} } = 1 - \frac{\sum \limits_i \min{(x_i, y_i)}}{\sum \limits_i \max{(x_i, y_i)}} \; \blacksquare

thus proving the identity.

Tanimoto distance

The Tanimoto distance, also called “generalized Jaccard distance”, is another common distance found in theoretical ecology. It is defined using dot products as

dT(x,y):=1xyxx+yyxy d_T(\bold x, \bold y) := 1 - \frac{\bold x \cdot \bold y}{\bold x \cdot \bold x + \bold y \cdot \bold y - \bold x \cdot \bold y}

The Tanimoto distance is equal to the Jaccard distance for binary vectors:

dJ(x,y)=dT(x,y)1min(xi,yi)max(xi,yi)=1xyxx+yyxy \begin{aligned} d_J(\bold x, \bold y) &= d_T(\bold x, \bold y)\\[1.5em] 1 - \frac{\sum \min(x_i, y_i)}{\sum \max(x_i, y_i)} &= 1 - \frac{\bold x \cdot \bold y}{\bold x \cdot \bold x + \bold y \cdot \bold y - \bold x \cdot \bold y} \end{aligned}

This identity follows from the fact that for binary vectors the dot product can be simplified to

xy=min(xi,yi)xx=xi \begin{aligned} \bold x \cdot \bold y &= \sum \min{(x_i, y_i)}\\ \bold x \cdot \bold x &= \sum x_i \end{aligned}

Simplifying dTd_T we get

dT(x,y)=1min(xi,yi)xi+yimin(xi,yi)=1min(xi,yi)(xi+yimin(xi,yi)+max(xi,yi)max(xi,yi))=1min(xi,yi)max(xi,yi)=dJ(x,y)   \begin{aligned} d_T(\mathbf{x}, \mathbf{y}) &= 1 -\frac{\sum \min(x_i, y_i)}{\sum x_i + \sum y_i - \sum \min(x_i, y_i)} \\[1.5em] &=1 - \frac{\sum \min(x_i, y_i)}{\sum \big( x_i + y_i - \min(x_i, y_i) + \max(x_i, y_i) - \max(x_i, y_i)\big )} \\[1.5em] &=1 - \frac{\sum \min(x_i, y_i)}{\sum \max(x_i, y_i)} = d_J(\mathbf{x}, \mathbf{y}) \; \blacksquare \end{aligned}

References