| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by theshrike79 23 hours ago

IMO comparing different models is like comparing songs or paintings or modern art.

There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them.

Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful.

You can also do benchmarks but how do you measure the output of those?

The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it.

1 comments

locknitpicker 23 hours ago

> IMO comparing different models is like comparing songs or paintings or modern art.

I don't think this is that subjective or vague.

There are a couple of crisp metrics that can be used to evaluate a model:

- given a prompt, does it finish a task (times X tasks)

- how much did it cost to finish the task

- how long did it took?

If all models are able to handle a class of tasks, they perform equally well.

If a model costs much more to finish a task, it is worse than other models.

If a model takes longer to finish a task, it is worse than other models.

The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.

link

theshrike79 18 hours ago

"Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

Or just that it's so much cheaper that the cost/benefit ratio is better?

Also "finish a task" is also subjective. I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

link

locknitpicker 1 hour ago

> "Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

I see you felt compelled to use the weasel word "anything" to put together an argument. That suggests you are very well aware that the difference between older models and the latest and greatest is not that significant, as you need to resort to coming up with a single example, any example at all no matter how far fetched, to try to put together a case.

And that says it all.

> Or just that it's so much cheaper that the cost/benefit ratio is better?

That too is another definition of quality, isn't it?

If you have two tools and one does the same job but is both cheaper and faster, it means it it objectively better.

> Also "finish a task" is also subjective.

No, it isn't. If you supply a prompt and you have a definition of done, and a model executes it and delivers what you asked then it finished the task successfully.

> I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

Nonsense. If you feel the need to put up strawmen then it's up to you to justify them. Please define "quality" and prove that a model such as fable has such a radically different output that in comparison the output of older models is "shitty".

I understand you feel the need to keep the hype bus going, but you need more than strawmen, weasel words, and hand waving to keep that hype afloat.

And the truth if the matter is that the models introduced in the oast year don't introduce any breakthrough and struggle to show significant improvements over older models.

link