Hacker News new | ask | show | jobs
by kure256 216 days ago
A small additional note for context:

I’m not arguing that “LLMs will replace browsing” in some absolute way — but it is observable that for many users, the entry point for information is shifting from search → assistant. When you actually inspect how models consume real websites today, the results are pretty uneven:

pages with clean HTML and predictable structure get parsed reliably

JSON-LD is used surprisingly often (but only if it’s correct and minimal)

heavy client-side rendering breaks extraction more than people expect

semantic markup still beats any “AI-enabled” tool by a mile

models hallucinate less when the source has clear hierarchy and meaning

This project isn’t trying to reinvent SEO — it’s more like exploring the minimum structural guarantees that make an LLM treat a page as a trustworthy, cite-able source instead of ignoring it or misreading it.

If anyone here has done experiments with:

how GPT, Claude, Gemini, Llama, etc. read arbitrary web pages

failure cases in parsing / hallucination caused by layout

the effect of metadata vs full-text signal

or even prompt strategies for web ingestion

…I’d genuinely love to compare notes.