My wife recently was doing some programming for work. The work involved creating a database that contained the names of bulls. The bulls often have interesting names like “KIRK ANDREWS FORCEFUL” or “AULDREEKIE ADDISON VACUM”.
It also happens that when people are searching for a bull, they may type in the wrong name, for example “AULDREKIE ADDISON VACUUM” or type “JISSELVLIEDT 135 BREAKOUT” instead of “IJSSELVLIEDT 135 BREAKOUT”.
So, given a (mistyped) bull name, how can we find out what we think the user probably meant? One way to do this is by using a metric known as Levenshtein distance (or edit distance). The edit distance between two strings is defined as the minimum number of inserts, deletes or substitutions of a single character required to transform one string into another. For example the following transforms all have a Levenshtein distance of 1:
BOY -> TOY (1 substitution) CHAT -> HAT (1 deletion) HAT -> CHAT (1 insertion)
While the following have a Levenshtein distance of 2:
PAPER -> TAPE (1 deletion, 1 substituion) TAPE -> TRADE (1 insertion, 1 substitution)
Wikipedia has a very good description of the algorithm.
They also have the pseudocode which I have implemented in Python. Here it is:
''' Write a function that calculates the Levenshtein distance between two words. ''' def levenshtein(first, second): m = len(first) + 1 n = len(second) + 1 # Declare an array... d = [ * n for j in range(m)] # Initialise the array... for i in range(m): d[i] = i for j in range(n): d[j] = j for i in xrange(1, m): for j in range(1, n): if first[i-1] == second[j-1]: # no operation required... d[i][j] = d[i - 1][j - 1] else: d[i][j] = min( d[i - 1][j], d[i][j - 1], d[i - 1][j - 1]) + 1 return d[m-1][n-1] def main(): list1 = 'great grate'.split() list2 = 'grate rate ate'.split() for word1 in list1: for word2 in list2: print 'Levenshtein distance between %s and %s is %d' % ( word1, word2, levenshtein(word1, word2)) if __name__ == '__main__': main()
Once again, the Python code looks very similar to the pseudocode.