Hacker News new | ask | show | jobs
by pocket_cheese 1959 days ago
As someone who has done a good bit of scraping, how a website is designed dictates how I scrape.

If it's a static website that has consistently structured HTML and is easy to enumerate through all the webpages I'm looking for, then simple python requests code will work.

The less clear case is when to use a headless browser vs reverse engineering JS/server side APIs. Typically, I will do like a 10 minute dive into the client side js and monitor ajax requests to see if it would be super easy to hit some API that returns JSON to get my data. If reverse engineering seems to hairy, then I will just do headless browser.

I have a really strong preference for hitting JSON apis directly because, well, you get JSON! Also you usually get more data then you even knew existed.

Then again, if I was creating a spider to recursively crawl a non-static website, then I think Headless is the path of least resistance. But usually, I'm trying to get data in the HTML, and not the whole document.

1 comments

I’ve been doing web scraping for the past 5 years and this is exactly the approach I take as well!