|
|
|
|
|
by sriramk
5654 days ago
|
|
Thanks but mine was definitely a toy. I think I got it to around 100K pages or so but that's about it (seemed like a big deal back then). You can see some of those posts here (http://web.archive.org/web/20041206230457/www.dotnetjunkies....). Quite embarrassing to see the quality of my output from back then Basically, I did the following - Pull down dmoz.org's datasets (not sure whether I crawled it or whether they had a dump - I think the latter)
- Spin up crawlers (implemented in C# at the time) on various machines, writing to a central repo. The actual design of the crawler was based on Mercator (check out the paper on citeseer)
- Use Lucene to construct TF.IDF indices on top of the repository
- Throw up a nice UI (with the search engine name spelled out in a Google-like font). The funny part is that this probably impressed the people evaluating the project more than anything else. I did do some cool hacks around showing a better snippet than Google did at the time but I just didn't have the networking bandwidth to do anything serious. Fun for a college project. The funny thing is a startup which is involved in search contacted me a few weeks ago precisely because of this project. I had to tell that person how much of a toy it was :) |
|