| > IMO comparing different models is like comparing songs or paintings or modern art. I don't think this is that subjective or vague. There are a couple of crisp metrics that can be used to evaluate a model: - given a prompt, does it finish a task (times X tasks) - how much did it cost to finish the task - how long did it took? If all models are able to handle a class of tasks, they perform equally well. If a model costs much more to finish a task, it is worse than other models. If a model takes longer to finish a task, it is worse than other models. The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs. |
Or just that it's so much cheaper that the cost/benefit ratio is better?
Also "finish a task" is also subjective. I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?