|
|
|
|
|
by danielfalbo
297 days ago
|
|
How to reliably HTML to MD for any page on the internet?
I remember struggling with this in the past How hard would it be to build an MCP that's basically a proxy for web search except it always tries to build the markdown version of the web pages instead of passing HTML? Basically Sosumi.ai but instead of working on only for Apple docs it works for any web page (including every doc on the internet) |
|
In many cases, a Markdown distillation of HTML can improve the signal-to-noise ratio — especially for sites mired in <div> tag soup (intentionally or not). But that's an optimization for token efficiency; LLMs can usually figure things out.
The motivation behind Sosumi is better understood as a matter of accessibility. The way AI assistants typically fetch content from the web precluded them from getting any useful information from developer.apple.com.
You could start to solve the generalized problem with an MCP that 1) used a headless browser to access content for sites that require JS, and 2) used sampling (i.e. having a tool use the host LLM) to summarize / distill HTML.