Hacker News new | ask | show | jobs
by kizer 1168 days ago
I think companies are moving too quickly with AI and LLMs in particular. I think that the data LLMs are trained on should be very well-known - not just sanitized and certainly not just trained on the "whole web". GPT-4 is unwieldy... it's incredibly powerful but is still unpredictable and has learned how many "bad patterns", so to speak, that we'll never know since its basically a giant black box.

The ChatGPT version is the least harmful in my opinion; sinister are the propagated problems when GPT is utilized under-the-hood as a component in services (such as Bing search).

1 comments

Nothing is actually trained on the "whole web". It's way too much content for the size of the models that we're dealing with - you can certainly train it on that, but there's a limit to what a model can "learn" based on its size. So in practice everybody is using curated subsets.

It would be much better indeed if we knew exactly what the training data was for every given model. But they will still hallucinate things that aren't directly in that data, but could be inferred from it somehow, so that won't solve the problem.