|
|
|
|
|
by hombre_fatal
410 days ago
|
|
This probably is about on-demand search, not about gathering training data. Crawling is more general + you get to consume it in its reconstituted form instead of deriving it yourself. Hooking up a data dump for special-cased websites is much more complicated than letting LLM bots do a generalized on-demand web search. Just think of how that logic would work. LLM wants to do a web search to answer your question. Some Wikimedia site is the top candidate. Instead of just going to the site, it uses this special code path that knows how to use https://{site}/{path} to figure out where {path} is in {site}'s data dump. |
|