|
|
|
|
|
by EugeneOZ
95 days ago
|
|
Not in my experience.
Quoting my tweet: Gave the same prompt to GPT 5.4 (high) and Opus 4.6 (high). GPT 5.4 implemented the feature, refactored the code (was not asked to), removed comments that were not added in that session, made the code less readable, and introduced a bug. "Undo All". Opus 4.6 correctly recognized that the feature is already implemented in the current code (yeah, lol) and proposed implementing tests and updating the docs. Opus 4.6 is still the best coding agent. So yeah, GPT 5.4 (high) didn't even check if the feature was already implemented. Tried other tasks, tried "medium" reasoning - disappointment. |
|
I am to sure one can really extrapolate much out of that, but I do find it interesting nonetheless.
I think language is also an important factor. I have a hard time deciding which of the two LLMs is worse at Swift, for example. They both seem equally great and awful in different ways.