Sunday, March 6, 2011

Some Near duplicate detection

I've been reading a lot about Data Mining recently and today I came across Shingling, a method for near duplicate detection.

The algorithm is simple, it creates a set of consecutive substrings of fixed length on 2 two documents and calculates the Jaccard distance between these two sets. The Jaccard distance is a number between 1 and 0, inclusive, calculated as the size of the intersection of two sets divided by the size of their union. i.e |A ∩ B| / |A U B|. If this distance is 1 then the sets are equal.

Here's a simple implementation of the technique in python:

shingles = lambda s: set(s[i:i+3] for i in range(len(s)-2))
jaccard_distance = lambda seta, setb: len(seta & setb)/float(len(seta | setb))

a = raw_input("Original text: ")
b = raw_input("Other text: ")
sa = shingles(a)
sb = shingles(b)
print "These texts are %.2f%% similar" % (jaccard_distance(sa, sb)*100)

Another article I came across is MinHash which is used for the same purpose.

Saturday, November 20, 2010

A Python script using Tornado web server with the Twitter Streaming API

So today I decided to play with Tornado the open source version of a scalable, non-blocking web server which powers FriendFeed. While doing so I discovered it will probably be cool to play with the Twitter streaming API as well and write a Python script that uses both.

Here's the result of my experiment:

from tornado.httpclient import HTTPClient, HTTPRequest
import simplejson

def handle_tweet_stream(response):
    try:
        # Print the parsed twitter stream.
        print simplejson.loads(response)
    except ValueError:
        return
    print

if __name__ == '__main__':
    req = HTTPRequest(
        "http://stream.twitter.com/1/statuses/filter.json", 
        body='track=google',
        method = "POST",
        auth_username = "username",
        auth_password = "xxxxxxx",
        streaming_callback=handle_tweet_stream)

    client = HTTPClient()
    client.fetch(req)

Thursday, October 21, 2010

Lessons learned from working in large scale

Before my internship with Google, I wasn't a big fan of unit tests, logging and cloud computing. Now I can say there's a degree of doubt in me when I see some new code without tests and that this code claims to do the right thing.

When you're running code in parallel on 1000's of machines in the cloud, machines whose physical location you don't know and machines you don't have shell access to, your idea of project development changes dramatically. Things like logging and unit testing become your friends.

A good blog on this topic is found on Matt Welsh's blog

Thursday, March 25, 2010

A one line Python Quine

OK, Here's a one line Python program that prints its own source code:

s='s=%r;print s%%s';print s%s

This wikipedia article on Quines is a very good place to learn how these things work

Saturday, February 27, 2010

One line quicksort in Python

Here's a one line quicksort in python. I came up with this idea after looking at a 2 lines implementation in Haskel

qsort = lambda seq: [] if not seq else qsort(filter(lambda n: n<=seq[0], seq[1:]))+[seq[0]]+qsort(filter(lambda n: n>seq[0], seq[1:]))

Monday, December 21, 2009

5 Handy python functions (lambdas)

Here's a list of 5 handy python lambda's you may need when solving a quick maths problem

1) The sum of all integers from 1 to n
triangle = lambda n: n*(n+1)/2

2) Factorial
factorial = lambda n: 1 if n<2 else n*factorial(n-1)
3) Get odd numbers
odds = lambda n: [k for k in range(1,n,2)]
4) Test whether a number is prime

is_prime = lambda n: False if n<2 else (0 not in [n%k for k in range(2,int(n**.5)+1)])
A number is prime when it has no factor other than 1 and itself. To determine whether the number is a prime, we need to check that it doesn't leave a reminder of 0 when divided by each number in the range of [2..square root of the number].

5) Primes sieve
This function returns the list of primes in the range [2..N]. It's the most cryptic of all but it makes use of the previous function to add a number to the generated list.
sieve = lambda n: [k for k in range(2,n+1) if (0 not in [k%i for i in range(2,int(k**.5)+1)])]
It could also be written as:

is_prime = lambda n: False if n<2 else (0 not in [n%k for k in range(2,int(n**.5)+1)])
sieve = lambda n: [k for k in range(2,n+1) if is_prime(k)]