Hacker News new | ask | show | jobs
Show HN: Snitchmd – Cloudflare-protected URLs into clean Markdown via Docker (github.com)
8 points by syabro 48 days ago
Shmauthor here. Built this for myself, putting it out in case it's useful.

Needed any URL as clean Markdown for LLM context — including Cloudflare/anti-bot sites. curl gets HTTP 403 on those, raw HTML is 80%+ nav noise eating context, paid SaaS (Firecrawl, Jina) wasn't an option for me.

It's a Docker wrapper around two existing OSS tools — CloakBrowser (stealth Chromium that passes Cloudflare) and rs-trafilatura (HTML → Markdown). No new scraper, just glue. Runs locally, my URLs stay on my box

Token reduction (raw curl HTML vs snitchmd, tiktoken cl100k_base):

- cloudflare.com/learning/bots — curl: HTTP 403 → snitchmd: 0.8k

- docs.docker.com/engine/install — 187k → 0.9k

- en.wikipedia.org/wiki/LLM — 222.7k → 29.7k

Heads up: passes Cloudflare, can't solve "click traffic lights" captchas (reCAPTCHA v2, hCaptcha)

MIT. Happy to answer questions

1 comments

What's the difference with playwright?