Hacker News new | ask | show | jobs
by tha-dude 5564 days ago
I've been dabbling in content-scraping, what bugs me is that with all the AJAX trickery that's going on, merely analyzing the XHTML source doesn't get you very far in many cases. Executing the page (JS, DOM and all) via browser-programming is an option but of course quite expensive. A headless browser is what's needed!
1 comments

Yeah. I think that is the challenge. A good way to get around the AJAX problem is to see if a site has an RSS feed and use that to extract content. I wish sites had a url for bots built in so you didnt have to do all this fancy stuff to extract the content.
Many of the big sites will feed you non-ajax content if you're the googlebot.