Hacker News new | ask | show | jobs
by shrikant 5663 days ago
IIRC, sriramk from around here (http://news.ycombinator.com/user?id=sriramk) had also 'rolled his own' web-crawler as a project in college about 5-6 (?) years back. He blogged about it fairly actively back then, and I really enjoyed following his journey (esp. when after months of dev and testing, he finally 'slipped it into the wild'). Tried to dredge up those posts, but he seems to have taken them down :( A shame really - they were quite a fascinating look at the early-stage evolution of a programmer!

Sriram, you around? ;)

1 comments

Thanks but mine was definitely a toy. I think I got it to around 100K pages or so but that's about it (seemed like a big deal back then).

You can see some of those posts here (http://web.archive.org/web/20041206230457/www.dotnetjunkies....). Quite embarrassing to see the quality of my output from back then

Basically, I did the following

- Pull down dmoz.org's datasets (not sure whether I crawled it or whether they had a dump - I think the latter) - Spin up crawlers (implemented in C# at the time) on various machines, writing to a central repo. The actual design of the crawler was based on Mercator (check out the paper on citeseer) - Use Lucene to construct TF.IDF indices on top of the repository - Throw up a nice UI (with the search engine name spelled out in a Google-like font). The funny part is that this probably impressed the people evaluating the project more than anything else.

I did do some cool hacks around showing a better snippet than Google did at the time but I just didn't have the networking bandwidth to do anything serious. Fun for a college project.

The funny thing is a startup which is involved in search contacted me a few weeks ago precisely because of this project. I had to tell that person how much of a toy it was :)

Do you remember how fast the "toy" was? (pages/second, domains/s, ...) :)
Not really but given the terrible hardware/network connectivity , wouldnt have made much sense now.

Because of this thread, I looked through my old backups and I actually still have the code. Should get it working again sometime

are you gonna put up your code ?

It would be interesting to see how to think through building a crawler (as opposed to downloading Nutch and trying to grok it)