| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by oezi 592 days ago
	Does it also include logic to download JS-driven sites properly or is this out of scope?

4 comments

simonw 592 days ago

It doesn't. For that you would need to execute a full headless browser first, extract the HTML (document.body.innerHTML after the page has finished loading can work) and process the result.

If you're already running a headless browser you may as well run the conversion in JavaScript though - I use this recipe pretty often with my shot-scraper tool: https://shot-scraper.datasette.io/en/stable/javascript.html#... - adding https://github.com/mixmark-io/turndown to the mix will get you Markdown conversion as well.

link

jot 592 days ago

We do that with Urlbox’s markdown feature: https://urlbox.com/extracting-text

link

JohannesKauf 592 days ago

That is unfortunately out of scope. I like the philosophy of doing one thing really well.

But nowadays—with Playwright and Puppeteer—there are great choices for Browser automation.

link

bni 592 days ago

I used https://github.com/mozilla/readability for this

link