| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by MichealCodes 265 days ago
	I really hope benchmarking improves soon to monitor the model in the weeks following the announcement. It really seems like these companies introduce a new "buffed" model and then slowly nerf the intelligence through optimizations. If we saw task performance week 1 vs week 8 on benchmarks, this would at least give us more insight into the loop here. In an environment lacking true progress a company could surely "show" it with this strategy.

1 comments

SubiculumCode 265 days ago

I do wonder about this. I just don't know if it real or in our heads

link

commakozzi 265 days ago

It does feel like it has to be real. I've noticed it since chatGPT with GPT-3.5, once it hit big news publicly and demands were made on "censoring" its output to limit biases, etc. (not inherently a problem to do this with LLMs as a society, but it does affect the output for obvious reasons). Whatever workflow OpenAI and others have applied, seems to be post-release somehow? i'm ignorant and just speculating, but literally every model release i've noticed it. Starts strong, ends up feeling less capable days, weeks, months after. I'm sure some of it could be in the parallelization of processing that has to occur to service the large amount of requests. and more and more traffic are spreading it thin?

link

MichealCodes 265 days ago

> I'm sure some of it could be in the parallelization of processing that has to occur to service the large amount of requests. and more and more traffic are spreading it thin?

Even if this is the case, benchmarks should be done at scale too if the models suffer from symptoms of scale. Otherwise the benchmarks are just a lie unless you have access to an unconstrained version of the model.

link

beefnugs 265 days ago

Capitalism is pure scam now on every level: they did this with nvme drives in the last couple years. Sending out perfect hardware to reviewers then rug pulling trash to ship to the world

link