|
|
|
|
|
by shrikant
5663 days ago
|
|
IIRC, sriramk from around here (http://news.ycombinator.com/user?id=sriramk) had also 'rolled his own' web-crawler as a project in college about 5-6 (?) years back. He blogged about it fairly actively back then, and I really enjoyed following his journey (esp. when after months of dev and testing, he finally 'slipped it into the wild'). Tried to dredge up those posts, but he seems to have taken them down :( A shame really - they were quite a fascinating look at the early-stage evolution of a programmer! Sriram, you around? ;) |
|
You can see some of those posts here (http://web.archive.org/web/20041206230457/www.dotnetjunkies....). Quite embarrassing to see the quality of my output from back then
Basically, I did the following
- Pull down dmoz.org's datasets (not sure whether I crawled it or whether they had a dump - I think the latter) - Spin up crawlers (implemented in C# at the time) on various machines, writing to a central repo. The actual design of the crawler was based on Mercator (check out the paper on citeseer) - Use Lucene to construct TF.IDF indices on top of the repository - Throw up a nice UI (with the search engine name spelled out in a Google-like font). The funny part is that this probably impressed the people evaluating the project more than anything else.
I did do some cool hacks around showing a better snippet than Google did at the time but I just didn't have the networking bandwidth to do anything serious. Fun for a college project.
The funny thing is a startup which is involved in search contacted me a few weeks ago precisely because of this project. I had to tell that person how much of a toy it was :)