The algorithm is simple, it creates a set of consecutive substrings of fixed length on 2 two documents and calculates the Jaccard distance between these two sets. The Jaccard distance is a number between 1 and 0, inclusive, calculated as the size of the intersection of two sets divided by the size of their union. i.e |A ∩ B| / |A U B|. If this distance is 1 then the sets are equal.

Here's a simple implementation of the technique in python:

shingles = lambda s: set(s[i:i+3] for i in range(len(s)-2))

jaccard_distance = lambda seta, setb: len(seta & setb)/float(len(seta | setb))

a = raw_input("Original text: ")

b = raw_input("Other text: ")

sa = shingles(a)

sb = shingles(b)

print "These texts are %.2f%% similar" % (jaccard_distance(sa, sb)*100)

Another article I came across is MinHash which is used for the same purpose.

## No comments:

## Post a Comment