| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by int_19h 1167 days ago
	Nothing is actually trained on the "whole web". It's way too much content for the size of the models that we're dealing with - you can certainly train it on that, but there's a limit to what a model can "learn" based on its size. So in practice everybody is using curated subsets. It would be much better indeed if we knew exactly what the training data was for every given model. But they will still hallucinate things that aren't directly in that data, but could be inferred from it somehow, so that won't solve the problem.