Hacker News new | ask | show | jobs
by elorant 2530 days ago
Have you tried loading a full browser session? Not just headless.
2 comments

Not the OP, but I did that about 12 years ago, with Firefox. My boss at the time had asked me to parse some public institution website that was quite difficult to write a parser for directly in Python, so in the end we just decided to write a quick extension for Firefox and let an instance of it run on a spare computer. That public institution website had some JS bug that would cause FF to gobble up memory pretty fast, but we also solved that by automatically restarting FF at certain intervals (or when we noticed something was off).

Not sure if people do this sort of things nowadays.

When I'm doing personal scraping, I just write a chrome extension. You can find boilerplates that are super easy to set up, and they persist in a background thread between page loads. It's really easy to collect the data and log it in the console or send it to a local API or database. It's the lowest effort method of scraping I know, and you can monitor it while it runs to make sure it doesn't get hung up on some edge case.
Sure we do. Through Selenium. You can either load a full browser session, or a headless one. But headless sessions are identifiable.
Selenium injects predictable Javascript in both situations.