| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ACCount37 279 days ago

"Anecdata" is notoriously unreliable when it comes to estimating AI performance over time.

Sure, people complain about Anthropic's AI models getting worse over time. As well as OpenAI's models getting worse over time. But guess what? If you serve them open weights models, they also complain about models getting worse over time. Same exact checkpoint, same exact settings, same exact hardware.

Relative LMArena metrics, however, are fairly consistent across time.

The takeaway is that users are not reliable LLM evaluators.

My hypothesis is that users have a "learning curve", and get better at spotting LLM mistakes over time - both overall and for a specific model checkpoint. Resulting in increasingly critical evaluations over time.

3 comments

ryoshu 279 days ago

Selection bias + perceptual adaptation is my experience. Selection bias happens when we play the probabilities of using an LLM and we only focus on the things it does really well, because it can be really amazing. When you use a model a lot you increasingly see when they don't work well your perception changes to focus on what doesn't work vs. the what does.

Living evals can solve for the quantitative issues with infra and model updates, but not sure how to deal with perceptual adaptation.

link

gowld 279 days ago

And survivor bias.

People who like the tool at first use it until they stop liking it -> "it got worse"

People who dislike the tool at first do not use it -> "it was bad"

link

rapind 279 days ago

And yet, people's complaints about Claude Code over the past month and a bit are now justified by Anthropic stating that those complaints caused them to investigate and fix a bunch of issues (while investigating potential more issues with opus).

> But guess what? If you serve them open weights models, they also complain about models getting worse over time.

Isn't this also anecdotal, or is there data informing this statement?

I think you could be partially right, but I also don't think dismissing criticism as just being a change in perspective is correct either. At least some complaints are from power users who can usually tell when something is getting objectively worse (as was the case for some of us Claude Code users recently). I'm not saying we can't fool ourselves too, but I don't think that's the most likely assumption to make.

link

yazanobeidi 279 days ago

You’re not wrong, but, I can literally see it get worse throughout the day sometimes, especially recently. Coinciding with Pacific Time Zone business hours.

Quantization could be done, not to deliberately make the model worse, but to increase reliability! Like Apple throttling devices - they were just trying to save your battery! After all there are regular outages, and some pretty major ones a handful of weeks back taking eg Opus offline for an entire afternoon.

link