|
|
|
|
|
by theshrike79
23 hours ago
|
|
IMO comparing different models is like comparing songs or paintings or modern art. There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them. Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful. You can also do benchmarks but how do you measure the output of those? The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it. |
|
I don't think this is that subjective or vague.
There are a couple of crisp metrics that can be used to evaluate a model:
- given a prompt, does it finish a task (times X tasks)
- how much did it cost to finish the task
- how long did it took?
If all models are able to handle a class of tasks, they perform equally well.
If a model costs much more to finish a task, it is worse than other models.
If a model takes longer to finish a task, it is worse than other models.
The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.