Java滚动数组计算编辑距离

编辑距离(Edit Distance),也称Levenshtein距离,是指由一个字符串转换为另一个字符串所需的最少编辑次数。

下面的代码摘自org.apache.commons.lang.StringUtils

用法示例:

StringUtils.getLevenshteinDistance(null, *)             = IllegalArgumentException
StringUtils.getLevenshteinDistance(*, null)             = IllegalArgumentException
StringUtils.getLevenshteinDistance("","")               = 0
StringUtils.getLevenshteinDistance("","a")              = 1
StringUtils.getLevenshteinDistance("aaapppp", "")       = 7
StringUtils.getLevenshteinDistance("frog", "fog")       = 1
StringUtils.getLevenshteinDistance("fly", "ant")        = 3
StringUtils.getLevenshteinDistance("elephant", "hippo") = 7
StringUtils.getLevenshteinDistance("hippo", "elephant") = 7
StringUtils.getLevenshteinDistance("hippo", "zzzzzzzz") = 8
StringUtils.getLevenshteinDistance("hello", "hallo")    = 1

Java代码:

public static int getLevenshteinDistance(String s, String t) {
    if (s == null || t == null) {
        throw new IllegalArgumentException("Strings must not be null");
    }

    int n = s.length(); // length of s
    int m = t.length(); // length of t

    if (n == 0) {
        return m;
    } else if (m == 0) {
        return n;
    }

    if (n > m) {
        // swap the input strings to consume less memory
        String tmp = s;
        s = t;
        t = tmp;
        n = m;
        m = t.length();
    }

    int p[] = new int[n+1]; //'previous' cost array, horizontally
    int d[] = new int[n+1]; // cost array, horizontally
    int _d[]; //placeholder to assist in swapping p and d

    // indexes into strings s and t
    int i; // iterates through s
    int j; // iterates through t

    char t_j; // jth character of t

    int cost; // cost

    for (i = 0; i<=n; i++) {
        p[i] = i;
    }

    for (j = 1; j<=m; j++) {
        t_j = t.charAt(j-1);
        d[0] = j;

        for (i=1; i<=n; i++) {
            cost = s.charAt(i-1)==t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

    // our last action in the above loop was to switch d and p, so p now 
    // actually has the most recent cost counts
    return p[n];
}

实际上,上述代码的空间复杂度还可以进一步简化,使用一维数组替换滚动数组。

Java代码:

public int minDistance(String s, String t) {
    if (s == null || t == null) {
        throw new IllegalArgumentException("Strings must not be null");
    }

    int n = s.length(); // length of s
    int m = t.length(); // length of t

    if (n == 0) {
        return m;
    } else if (m == 0) {
        return n;
    }

    if (n > m) {
        // swap the input strings to consume less memory
        String tmp = s;
        s = t;
        t = tmp;
        n = m;
        m = t.length();
    }

    int d[] = new int[n+1]; // cost array, horizontally

    // indexes into strings s and t
    int i; // iterates through s
    int j; // iterates through t

    char t_j; // jth character of t

    int cost; // cost

    for (i = 0; i<=n; i++) {
        d[i] = i;
    }
    for (j = 1; j<=m; j++) {
        t_j = t.charAt(j-1);
        int pre = d[0];
        d[0] = j;
        for (i=1; i<=n; i++) {
            int temp = d[i];
            cost = s.charAt(i-1)==t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, d[i]+1),  pre+cost);
            pre = temp;
        }    
    }

    return d[n];
}

 

本文链接:http://bookshadow.com/weblog/2016/06/14/java-levenshtein-distance-with-two-arrays/
请尊重作者的劳动成果,转载请注明出处!书影博客保留对文章的所有权利。

如果您喜欢这篇博文,欢迎您捐赠书影博客: ,查看支付宝二维码