| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tha-dude 5611 days ago
	I've been dabbling in content-scraping, what bugs me is that with all the AJAX trickery that's going on, merely analyzing the XHTML source doesn't get you very far in many cases. Executing the page (JS, DOM and all) via browser-programming is an option but of course quite expensive. A headless browser is what's needed!

1 comments

pkandathil 5610 days ago

Yeah. I think that is the challenge. A good way to get around the AJAX problem is to see if a site has an RSS feed and use that to extract content. I wish sites had a url for bots built in so you didnt have to do all this fancy stuff to extract the content.

link

buss 5610 days ago

Many of the big sites will feed you non-ajax content if you're the googlebot.

link