9 January 2024
Dissimilarity indices and identities
by markolenik
The two common dissimilarity indices used in microbiome research are the Bray-Curtis and the Jaccard indices.
Both can be applied to either abundance data (non-negative vectors) or presence/absence data (binary vectors).
The Bray-Curtis index is more commonly used for non-negative vectors and the Jaccard index for binary vectors, however, here I’ll consider only binary vectors.
In R they are implemented in the vegan
package as
vegdist(x, method="bray", binary=FALSE, diag=FALSE, upper=FALSE,
na.rm = FALSE, ...)
The documentation for the vegdist
function is cramped and poorly structured, like most R docs.
But it contains some interesting identities that I’d like to derive here.
Note on terminology
The terminology in the context of ecological dissimilarities is all over the place.
Very often the terms “metric”, “distance”, and “dissimilarity” are used interchangeably, e.g. Kers et al (2021) writes “Bray-Curtis metric” and “Bray-Curtis dissimilarity”.
Only the latter is correct, since the Bray-Curtis does not satisfy the triangle inequality and is therefore not a metric (e.g. see mathworld).
The terms “coefficient” and “index” are also used in a hand-wavy way.
Throughout the documentation of vegdist()
the authors use “index” to mean either “dissimilarity” or “distance”.
Wikipedia uses “index” and means “coefficient” or “similarity”, e.g. in the Jaccard index, where
dissimilarity=1−similarity
Bray-Curtis dissimilarity
The Bray-Curtis dissimilarity dB for binary vectors x,y∈{0,1}N is informally (e.g. in wikipedia) defined as
dB(x,y):=1−Sx+Sy2Cxy
where Cxy is the number of common species between x and y, and Sx and Sy are the total number of species in x and y, respectively.
A more explicit definition is
dB(x,y):=1−i∑(xi+yi)2i∑min(xi,yi)=(1)i∑(xi+yi)i∑∣xi−yi∣
The identity (1) can be shown using the properties of the absolute value function:
∣x−y∣=max(x,y)−min(x,y)
Then
1−i∑(xi+yi)2i∑min(xi,yi)=i∑(xi+yi)i∑(xi+yi−2min(xi,yi))=i∑(xi+yi)i∑(xi+yi−2min(xi,yi)+max(xi,yi)−max(xi,yi))=i∑(xi+yi)i∑(xi+yi−max(xi,yi)−min(xi,yi)+∣xi−yi∣)=i∑(xi+yi)i∑(∣xi−yi∣)■
The last step follows from the fact that
max(xi,yi)+min(xi,yi)=xi+yi
Jaccard distance
The Jaccard distance is defined as
dJ(x,y):=1−i∑max(xi,yi)i∑min(xi,yi)
According to vegdist
it can be expressed in terms of the Bray-Curtis dissimilarity as
dJ(x,y)=1+dB(x,y)2dB(x,y)
Substituting the definition of dB and simplifying we get
dJ(x,y)=i∑(xi+yi)+i∑∣xi−yi∣2i∑∣xi−yi∣
To prove this identity we start by simplifying the numerator as follows
2i∑∣xi−yi∣=2i∑max(xi,yi)−2i∑min(xi,yi)
We simplify the denominator as follows
i∑(xi+yi)+i∑∣xi−yi∣=i∑(xi+yi+max(xi,yi)−min(xi,yi))=i∑(xi+yi+max(xi,yi)−min(xi,yi)+max(xi,yi)−max(xi,yi))=2i∑max(xi,yi)
Putting everything together we get
2i∑max(xi,yi)2i∑max(xi,yi)−2i∑min(xi,yi)=1−i∑max(xi,yi)i∑min(xi,yi)■
thus proving the identity.
Tanimoto distance
The Tanimoto distance, also called “generalized Jaccard distance”, is another common distance found in theoretical ecology.
It is defined using dot products as
dT(x,y):=1−x⋅x+y⋅y−x⋅yx⋅y
The Tanimoto distance is equal to the Jaccard distance for binary vectors:
dJ(x,y)1−∑max(xi,yi)∑min(xi,yi)=dT(x,y)=1−x⋅x+y⋅y−x⋅yx⋅y
This identity follows from the fact that for binary vectors the dot product can be simplified to
x⋅yx⋅x=∑min(xi,yi)=∑xi
Simplifying dT we get
dT(x,y)=1−∑xi+∑yi−∑min(xi,yi)∑min(xi,yi)=1−∑(xi+yi−min(xi,yi)+max(xi,yi)−max(xi,yi))∑min(xi,yi)=1−∑max(xi,yi)∑min(xi,yi)=dJ(x,y)■
References