Hacker News new | ask | show | jobs
by oezi 592 days ago
Does it also include logic to download JS-driven sites properly or is this out of scope?
4 comments

It doesn't. For that you would need to execute a full headless browser first, extract the HTML (document.body.innerHTML after the page has finished loading can work) and process the result.

If you're already running a headless browser you may as well run the conversion in JavaScript though - I use this recipe pretty often with my shot-scraper tool: https://shot-scraper.datasette.io/en/stable/javascript.html#... - adding https://github.com/mixmark-io/turndown to the mix will get you Markdown conversion as well.

We do that with Urlbox’s markdown feature: https://urlbox.com/extracting-text
That is unfortunately out of scope. I like the philosophy of doing one thing really well.

But nowadays—with Playwright and Puppeteer—there are great choices for Browser automation.