| HN Mirror

Thanks but mine was definitely a toy. I think I got it to around 100K pages or so but that's about it (seemed like a big deal back then).

You can see some of those posts here (http://web.archive.org/web/20041206230457/www.dotnetjunkies....). Quite embarrassing to see the quality of my output from back then

Basically, I did the following

- Pull down dmoz.org's datasets (not sure whether I crawled it or whether they had a dump - I think the latter) - Spin up crawlers (implemented in C# at the time) on various machines, writing to a central repo. The actual design of the crawler was based on Mercator (check out the paper on citeseer) - Use Lucene to construct TF.IDF indices on top of the repository - Throw up a nice UI (with the search engine name spelled out in a Google-like font). The funny part is that this probably impressed the people evaluating the project more than anything else.

I did do some cool hacks around showing a better snippet than Google did at the time but I just didn't have the networking bandwidth to do anything serious. Fun for a college project.

The funny thing is a startup which is involved in search contacted me a few weeks ago precisely because of this project. I had to tell that person how much of a toy it was :)