Hacker News new | ask | show | jobs
by sriramk 5654 days ago
Thanks but mine was definitely a toy. I think I got it to around 100K pages or so but that's about it (seemed like a big deal back then).

You can see some of those posts here (http://web.archive.org/web/20041206230457/www.dotnetjunkies....). Quite embarrassing to see the quality of my output from back then

Basically, I did the following

- Pull down dmoz.org's datasets (not sure whether I crawled it or whether they had a dump - I think the latter) - Spin up crawlers (implemented in C# at the time) on various machines, writing to a central repo. The actual design of the crawler was based on Mercator (check out the paper on citeseer) - Use Lucene to construct TF.IDF indices on top of the repository - Throw up a nice UI (with the search engine name spelled out in a Google-like font). The funny part is that this probably impressed the people evaluating the project more than anything else.

I did do some cool hacks around showing a better snippet than Google did at the time but I just didn't have the networking bandwidth to do anything serious. Fun for a college project.

The funny thing is a startup which is involved in search contacted me a few weeks ago precisely because of this project. I had to tell that person how much of a toy it was :)

1 comments

Do you remember how fast the "toy" was? (pages/second, domains/s, ...) :)
Not really but given the terrible hardware/network connectivity , wouldnt have made much sense now.

Because of this thread, I looked through my old backups and I actually still have the code. Should get it working again sometime

are you gonna put up your code ?

It would be interesting to see how to think through building a crawler (as opposed to downloading Nutch and trying to grok it)