Hacker News new | ask | show | jobs
by timthelion 936 days ago
I think we need a new type of status page or at least a public version number on llms, yesterday for me GPT4 started giving nonsense super generic answers, like it was hardly reading what I wrote, and today it is back to top notch performance. I think they were trying to make the model more efficient or something but I just saw a massive decrease in the quality of output. From my side though, there is no version number except for "4"...
3 comments

On a recent interview of Sam Altman (Hard Fork podcast) he mentioned that due to the load they have been trying to make optimizations, disable certain features, etc. so it’s not outside the realm of possibility that some tweak caused this.
I think one of the harder things about developing these models is that regressions are hard to figure out or even detect.
That’s a good point, would be curious to understand more what the testing setup is like for these kinds of systems.
I experience the same with the API, it simply ignored all system message. This things should be told beforehand clearly
This conspiracy always comes up - don't you think that they test the output of the model revisions on probably 1000s of downstream tasks at this point? Bad responses are hard to reason about, could be prompting, could be a model revision, could just be bad luck.
Or maybe they are just AB testing and aggressively optimizing the response generation?

LLMs are known to be compute/energy hungry to execute. It is a developing technology, if not downright experimental.

Therefore, this explanation is very likely. I cannot see the reason to call this a conspiracy.

AB testing on what? AB tests need to produce some results which are then compared. How would releasing different versions in production help with that?

It would make more sense if that was internal and the responses were then graded.

A failed canary release would be more likely, where they released this version to a small amount of people not realising it was bad

On top of my mind: responses have feedback buttons below them.

You can simply deploy different versions and compare the neutral + positive / negative feeback ratio.

It would be sinful if they did not add other metrics like how many times the user had to correct and update their prompt before ending the chat, etc.

Data, data, data...

There are the up down thumbs and automatic sentiment analysis as a test.
Calling that a conspiracy is like saying its a conspiracy theory that Meta shows different people different Ads. I'd be more concerned if OpenAI WASN'T constantly trying to tune their models. Its literaly their job to tune the models.