My wife recently was doing some programming for work. The work involved creating a database that contained the names of bulls. The bulls often have interesting names like “KIRK ANDREWS FORCEFUL” or “AULDREEKIE ADDISON VACUM”.

It also happens that when people are searching for a bull, they may type in the wrong name, for example “AULDREKIE ADDISON VACUUM” or type “JISSELVLIEDT 135 BREAKOUT” instead of “IJSSELVLIEDT 135 BREAKOUT”.

So, given a (mistyped) bull name, how can we find out what we think the user probably meant? One way to do this is by using a metric known as Levenshtein distance (or edit distance). The edit distance between two strings is defined as the minimum number of inserts, deletes or substitutions of a single character required to transform one string into another. For example the following transforms all have a Levenshtein distance of 1:

``````BOY -> TOY (1 substitution)
CHAT -> HAT (1 deletion)
HAT -> CHAT (1 insertion)
``````

While the following have a Levenshtein distance of 2:

``````PAPER -> TAPE (1 deletion, 1 substituion)
TAPE -> TRADE (1 insertion, 1 substitution)
``````

Wikipedia has a very good description of the algorithm.

They also have the pseudocode which I have implemented in Python. Here it is:

``````'''
Write a function that calculates the Levenshtein distance between
two words.
'''

def levenshtein(first, second):
m = len(first) + 1
n = len(second) + 1

# Declare an array...
d = [ * n for j in range(m)]

# Initialise the array...
for i in range(m):
d[i] = i

for j in range(n):
d[j] = j

for i in xrange(1, m):
for j in range(1, n):
if first[i-1] == second[j-1]: # no operation required...
d[i][j] = d[i - 1][j - 1]
else:
d[i][j] = min( d[i - 1][j], d[i][j - 1],
d[i - 1][j - 1]) + 1
return d[m-1][n-1]

def main():
list1 = 'great grate'.split()
list2 = 'grate rate ate'.split()

for word1 in list1:
for word2 in list2:
print 'Levenshtein distance between %s and %s is %d' % (
word1, word2, levenshtein(word1, word2))

if __name__ == '__main__':
main()
``````

Once again, the Python code looks very similar to the pseudocode.