| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hombre_fatal 456 days ago

This probably is about on-demand search, not about gathering training data.

Crawling is more general + you get to consume it in its reconstituted form instead of deriving it yourself.

Hooking up a data dump for special-cased websites is much more complicated than letting LLM bots do a generalized on-demand web search.

Just think of how that logic would work. LLM wants to do a web search to answer your question. Some Wikimedia site is the top candidate. Instead of just going to the site, it uses this special code path that knows how to use https://{site}/{path} to figure out where {path} is in {site}'s data dump.

1 comments

black_puppydog 456 days ago

Yeah. Much easier to tragedy-of-the-commons the hell out of what is arguably one of the only consistently great achievements on the web...

link