Hacker News new | ask | show | jobs
by _verandaguy 62 days ago
This still doesn't really answer my question, though. This is like telling me my old blog posts can't be parsed by your regex.

Like... yeah, no shit; I didn't build it for your regex. It's not the target audience.

Plus, isn't the appeal of LLMs broadly that they can do somewhat-useful things with mostly-arbitrary input (if you ignore the risk of prompt injection)?

1 comments

> Plus, isn't the appeal of LLMs broadly that they can do somewhat-useful things with mostly-arbitrary input (if you ignore the risk of prompt injection)?

They can definitely read HTML, but they do better with more structure. I proposed in a sibling comment for example that the "reader mode" feature in browsers might be a great LLM-compatibility feature to reduce all the HTML token noise. Or exposing an HTTP API with an OpenAPI schema and a proper sitemap and an RSS feed. For example fetching from an RSS feed can be exposed to the LLM as a "tool" that it can call.

I don't think it's fair to say that HTML's less structured than Markdown. Markdown is derived from a simplified subset of HTML, and having myself cut my teeth on HTML5 when it was still new, there's been a huge emphasis on the idea of the semantic web conveyed through HTML.