Hacker News new | ask | show | jobs
by simonw 269 days ago
Converting HTML into Markdown isn't particularly hard. Two methods I use:

1. The Jina reader API - https://jina.ai/reader/ - add r.jina.ai to any URL to run it through their hosted conversion proxy, eg https://r.jina.ai/www.skeptrune.com/posts/use-the-accept-hea...

2. Applying Readability.js and Turndown via Playwright. Here's a shell script that does that using my https://shot-scraper.datasette.io tool: https://gist.github.com/simonw/82e9c5da3f288a8cf83fb53b39bb4...

2 comments

I learned that the golang CLI[1] is the best through my work simplifying Firecrawl[2]. However, in this case I used one available through npmjs such that it would work with `npx` for the CF worker builds.

[1]: https://github.com/JohannesKaufmann/html-to-markdown

[2]: https://github.com/devflowinc/firecrawl-simple

A lightweight alternative to Playwright, which starts a browser instance, is using an HTML parser and DOM implementation like linkedom.

This is much cheaper to run on a server. For example: https://github.com/ozanmakes/scrapedown