| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cypherpunks 5476 days ago
	The Google Spider grabs pages from other servers. That's a web client. It's a web client that was easy to write when Google was started, but is almost impossible to write today. If search engines hadn't been invented 20 years ago, they'd be impossible to invent today. The only reason they still work is tremendous work on Google's end to have its spider be able to spider complex AJAXy pages, and that content creators engage in SEO and develop to Google.

3 comments

jm4 5476 days ago

It is not harder or easier. It is just different than it used to be. Things that used to be hard are easy now. Problems that didn't exist 10 years ago exist today. I develop a spider.

For the most part, the bulk of the web's content is as easily accessible as it was years ago. You make a request and you get a blob of HTML back. If you have special requirements and need to get into all the nooks and crannies you create a DOM implementation and embed a JavaScript engine. Then you parse the page into a DOM and start firing off events. There are quality open source JavaScript engines available. JavaScript and AJAX are a breeze.

Flash is a different story. If you have any requirement to follow links or process content in a Flash movie (you'd be surprised how many sites still have Flash nav) you pretty much have to write your own runtime. Unless you are big enough to have Adobe do it for you.

Depending on what you are doing with the data that your spider collects, chances are writing a spider is far easier than writing a browser. There are at least 4 widely used browser engines and plenty more toy browsers floating around.

I can guarantee that writing a spider that can deal with AJAX is not the biggest challenge of developing a search engine. Scaling it, fighting SPAM, understanding the content, indexing and then being able to provide quick lookups are much, much harder.

link

VBprogrammer 5476 days ago

I recently looked through the Google developer guidelines and they still recommend not changing the page contents significantly using Javascript. Also, the !# in modern AJAX apps is there to avoid having to run the Javascript in order to crawl the content. Do Google actually do that much with the Javascript on a page even today?

link

methodin 5476 days ago

I don't really quite get what point you are making here. Are you saying that the innovation today owes more to the innovation of the mid-nineties? Are you saying that any innovation now with respect to AJAX and JS does not make your life better in any way?

Seemingly you argument could be made for cars "15 years ago cars were easy to fix and understand. Now they are not, so wake me up when they are like the cars of the mid-nineties."

JS/Ajax help programmers tremendously. Helps speed. Helps functionality.

To view these things through the lense of "I can't write a crawler for them" is a pretty limited view of what today's technology offers.

link

stanleydrew 5476 days ago

Indeed I see now what you mean.

link

cypherpunks 5476 days ago

I'm glad. Thank you for taking the time to read and understand. Hacker News is starting to go down the decline that hit reddit 2 years ago, where people don't bother to try to understand different viewpoints, and just downvote anything they don't agree with. It's nice to see good people still on here...

link