|
|
|
|
|
by cypherpunks
5476 days ago
|
|
The Google Spider grabs pages from other servers. That's a web client. It's a web client that was easy to write when Google was started, but is almost impossible to write today. If search engines hadn't been invented 20 years ago, they'd be impossible to invent today. The only reason they still work is tremendous work on Google's end to have its spider be able to spider complex AJAXy pages, and that content creators engage in SEO and develop to Google. |
|
For the most part, the bulk of the web's content is as easily accessible as it was years ago. You make a request and you get a blob of HTML back. If you have special requirements and need to get into all the nooks and crannies you create a DOM implementation and embed a JavaScript engine. Then you parse the page into a DOM and start firing off events. There are quality open source JavaScript engines available. JavaScript and AJAX are a breeze.
Flash is a different story. If you have any requirement to follow links or process content in a Flash movie (you'd be surprised how many sites still have Flash nav) you pretty much have to write your own runtime. Unless you are big enough to have Adobe do it for you.
Depending on what you are doing with the data that your spider collects, chances are writing a spider is far easier than writing a browser. There are at least 4 widely used browser engines and plenty more toy browsers floating around.
I can guarantee that writing a spider that can deal with AJAX is not the biggest challenge of developing a search engine. Scaling it, fighting SPAM, understanding the content, indexing and then being able to provide quick lookups are much, much harder.