谷本相似度和谷本距离

Various forms of functions described as Tanimoto similarity and Tanimoto distance occur in the literature and on the Internet. Most of these are synonyms for Jaccard similarity and Jaccard distance, but some are mathematically different. Many sources[3] cite an unavailable IBM Technical Report[4] as the seminal reference.

在文献中和互联网上有各种形式的称为谷本相似度和谷本距离的函数。它们中的大部分是杰卡德相似度和杰卡德距离的同义词,但是其中一些在数学上有着一定的不同。

In "A Computer Program for Classifying Plants", published in October 1960,[5] a method of classification based on a similarity ratio, and a derived distance function, is given. It seems that this is the most authoritative source for the meaning of the terms "Tanimoto similarity" and "Tanimoto Distance". The similarity ratio is equivalent to Jaccard similarity, but the distance function is not the same as Jaccard distance.

在1960年10月出版的《植物分类计算机程序》中,提出了一个基于相似度比值的分类方法,并给出了一个导出的距离函数。这似乎是术语“谷本相似度”和“谷本距离”最权威的来源。其相似度比值等价于杰卡德相似度,但是距离函数与杰卡德距离不同。

Tanimoto's definitions of similarity and distance

谷本相似度和谷本距离的定义

In that paper, a "similarity ratio" is given over bitmaps, where each bit of a fixed-size array represents the presence or absence of a characteristic in the plant being modelled. The definition of the ratio is the number of common bits, divided by the number of bits set in either sample.

在那篇论文中,“相似度比值”以位图形式给出,其中定长数组的每一个位代表模型化植物存在与否。比值的定义是公共位的个数,除以样本中所有位的个数。

Presented in mathematical terms, if samples X and Y are bitmaps, X_i is the ith bit of X, and  \land , \lor  are bitwise andor operators respectively, then the similarity ratio T_s is

以数学形式展现,如果样本X和Y是位图,Xi是X的第i位,  \land , \lor 分别代表按位与和按位或操作,那么相似度比值Ts是

 T_s(X,Y) =  \frac{\sum_i ( X_i \land Y_i)}{\sum_i ( X_i \lor Y_i)}

If each sample is modelled instead as a set of attributes, this value is equal to the Jaccard Coefficient of the two sets. Jaccard is not cited in the paper, and it seems likely that the authors were not aware of it.

如果每一个样本以属性集合的方式建模,这个值等同于两个集合的杰卡德系数。杰卡德在论文中没有提到,并且似乎作者并没有意识到这一点。

Tanimoto goes on to define a distance coefficient based on this ratio, defined over values with non-zero similarity:

谷本在比值的基础上定义了距离系数,基于非零相似度的比值:

T_d(X,Y) = -\log_2 ( T_s(X,Y) )

This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different from each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.

这个系数本意不是距离度量。它被选择允许两个彼此很不相同的样本,均与第三个样本相似。很容易构造一个不服从三角形不等式的例子。

Other definitions of Tanimoto distance

谷本距离的其他定义

Tanimoto distance is often referred to, erroneously, as a synonym for Jaccard distance ( 1 - T_s). This function is a proper distance metric. "Tanimoto Distance" is often stated as being a proper distance metric, probably because of its confusion with Jaccard distance.

谷本距离通常是指(错误的)杰卡德距离(1-Ts)的同义词。这个函数式一个适当的距离度量。“谷本距离”通常描述为一个适当的距离度量,可能是因为它与杰卡德距离相混淆。

If Jaccard or Tanimoto similarity is expressed over a bit vector, then it can be written as

如果杰卡德相似度或者谷本相似度以位向量的形式表述,那么它可以写作下面的公式:


f(A,B) =\frac{ A \cdot B}{\vert A\vert^2 +\vert B\vert^2 -  A \cdot B }

where the same calculation is expressed in terms of vector scalar product and magnitude. This representation relies on the fact that, for a bit vector (where the value of each dimension is either 0 or 1) then 

其中相同的计算以数量积和向量大小的形式表示。这种表示依赖于这样的事实,对于一个位向量(其中每一维度的值是0或者1),那么

A \cdot B = \sum_i A_iB_i = \sum_i ( A_i \land B_i) and {\vert A\vert}^2 = \sum_i A_i^2 = \sum_i A_i .

This is a potentially confusing representation, because the function as expressed over vectors is more general, unless its domain is explicitly restricted. Properties of  T_s  do not necessarily extend to f. In particular, the difference function ( 1 - f) does not preserve triangle inequality, and is not therefore a proper distance metric, whereas ( 1 - T_s)  is.

There is a real danger that the combination of "Tanimoto Distance" being defined using this formula, along with the statement "Tanimoto Distance is a proper distance metric" will lead to the false conclusion that the function (1 - f) is in fact a distance metric over vectors or multisets in general, whereas its use in similarity search or clustering algorithms may fail to produce correct results.

这是一个潜在的可能引起混淆的表示,因为上面以向量表示的函数更加通用,除非显式地定义了其领域。Ts的属性不一定扩展(extend to)到f。特别的,差异函数(1-f)不服从三角形不等式,并且因此不是一个适当的距离度量,而(1-f)是。

使用此公式定义的“谷本距离”,与声明“谷本距离是合适的距离度量”会得到错误的结论,函数(1-f)实际上是向量或者多重集合的通用距离度量,然而它在相似度查询或者聚类算法中的使用可能会产生错误的结果。

Lipkus[1] uses a definition of Tanimoto similarity which is equivalent to f, and refers to Tanimoto distance as the function  (1 - f) . It is however made clear within the paper that the context is restricted by the use of a (positive) weighting vector W such that, for any vector A being considered,  A_i \in \{0,W_i\} . Under these circumstances, the function is a proper distance metric, and so a set of vectors governed by such a weighting vector forms a metric space under this function.

Lipkus使用了等价于f的谷本相似度的定义,并且将谷本距离定义为函数(1-f)。然而在论文中明确的是上下文限定于使用(整的)带权向量W,对于任何向量A, A_i \in \{0,W_i\} 。基于这些限制条件,函数式一个适当的距离度量,因而通过这样一个带权向量约束的向量集在此函数下组成了度量空间。

本文链接:http://bookshadow.com/weblog/2014/06/07/tanimoto-similarity-and-distance/
请尊重作者的劳动成果,转载请注明出处!书影博客保留对文章的所有权利。

如果您喜欢这篇博文,欢迎您捐赠书影博客: ,查看支付宝二维码