## 谷本相似度和谷本距离 作者是 在线疯狂 发布于 2014年6月7日 在 译林.

Various forms of functions described as Tanimoto similarity and Tanimoto distance occur in the literature and on the Internet. Most of these are synonyms for Jaccard similarity and Jaccard distance, but some are mathematically different. Many sources[3] cite an unavailable IBM Technical Report[4] as the seminal reference.

In "A Computer Program for Classifying Plants", published in October 1960,[5] a method of classification based on a similarity ratio, and a derived distance function, is given. It seems that this is the most authoritative source for the meaning of the terms "Tanimoto similarity" and "Tanimoto Distance". The similarity ratio is equivalent to Jaccard similarity, but the distance function is not the same as Jaccard distance.

### 谷本相似度和谷本距离的定义

In that paper, a "similarity ratio" is given over bitmaps, where each bit of a fixed-size array represents the presence or absence of a characteristic in the plant being modelled. The definition of the ratio is the number of common bits, divided by the number of bits set in either sample.

Presented in mathematical terms, if samples X and Y are bitmaps, $X_i$ is the ith bit of X, and $\land , \lor$ are bitwise andor operators respectively, then the similarity ratio $T_s$ is

$T_s(X,Y) = \frac{\sum_i ( X_i \land Y_i)}{\sum_i ( X_i \lor Y_i)}$

If each sample is modelled instead as a set of attributes, this value is equal to the Jaccard Coefficient of the two sets. Jaccard is not cited in the paper, and it seems likely that the authors were not aware of it.

Tanimoto goes on to define a distance coefficient based on this ratio, defined over values with non-zero similarity:

$T_d(X,Y) = -\log_2 ( T_s(X,Y) )$

This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different from each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.

### 谷本距离的其他定义

Tanimoto distance is often referred to, erroneously, as a synonym for Jaccard distance ($1 - T_s$). This function is a proper distance metric. "Tanimoto Distance" is often stated as being a proper distance metric, probably because of its confusion with Jaccard distance.

If Jaccard or Tanimoto similarity is expressed over a bit vector, then it can be written as

$f(A,B) =\frac{ A \cdot B}{\vert A\vert^2 +\vert B\vert^2 - A \cdot B }$

where the same calculation is expressed in terms of vector scalar product and magnitude. This representation relies on the fact that, for a bit vector (where the value of each dimension is either 0 or 1) then

$A \cdot B = \sum_i A_iB_i = \sum_i ( A_i \land B_i)$ and ${\vert A\vert}^2 = \sum_i A_i^2 = \sum_i A_i$.

This is a potentially confusing representation, because the function as expressed over vectors is more general, unless its domain is explicitly restricted. Properties of $T_s$ do not necessarily extend to $f$. In particular, the difference function $( 1 - f)$ does not preserve triangle inequality, and is not therefore a proper distance metric, whereas $( 1 - T_s)$ is.

There is a real danger that the combination of "Tanimoto Distance" being defined using this formula, along with the statement "Tanimoto Distance is a proper distance metric" will lead to the false conclusion that the function $(1 - f)$ is in fact a distance metric over vectors or multisets in general, whereas its use in similarity search or clustering algorithms may fail to produce correct results.

Lipkus[1] uses a definition of Tanimoto similarity which is equivalent to $f$, and refers to Tanimoto distance as the function $(1 - f)$. It is however made clear within the paper that the context is restricted by the use of a (positive) weighting vector $W$ such that, for any vector A being considered, $A_i \in \{0,W_i\}$. Under these circumstances, the function is a proper distance metric, and so a set of vectors governed by such a weighting vector forms a metric space under this function.

Lipkus使用了等价于f的谷本相似度的定义，并且将谷本距离定义为函数(1-f)。然而在论文中明确的是上下文限定于使用（整的）带权向量W，对于任何向量A，$A_i \in \{0,W_i\}$。基于这些限制条件，函数式一个适当的距离度量，因而通过这样一个带权向量约束的向量集在此函数下组成了度量空间。

Pingbacks已关闭。