Hacker News new | ask | show | jobs
by jm4 5472 days ago
It is not harder or easier. It is just different than it used to be. Things that used to be hard are easy now. Problems that didn't exist 10 years ago exist today. I develop a spider.

For the most part, the bulk of the web's content is as easily accessible as it was years ago. You make a request and you get a blob of HTML back. If you have special requirements and need to get into all the nooks and crannies you create a DOM implementation and embed a JavaScript engine. Then you parse the page into a DOM and start firing off events. There are quality open source JavaScript engines available. JavaScript and AJAX are a breeze.

Flash is a different story. If you have any requirement to follow links or process content in a Flash movie (you'd be surprised how many sites still have Flash nav) you pretty much have to write your own runtime. Unless you are big enough to have Adobe do it for you.

Depending on what you are doing with the data that your spider collects, chances are writing a spider is far easier than writing a browser. There are at least 4 widely used browser engines and plenty more toy browsers floating around.

I can guarantee that writing a spider that can deal with AJAX is not the biggest challenge of developing a search engine. Scaling it, fighting SPAM, understanding the content, indexing and then being able to provide quick lookups are much, much harder.

1 comments

I recently looked through the Google developer guidelines and they still recommend not changing the page contents significantly using Javascript. Also, the !# in modern AJAX apps is there to avoid having to run the Javascript in order to crawl the content. Do Google actually do that much with the Javascript on a page even today?