|
|
|
|
|
by h4ny
307 days ago
|
|
I have been seeing different people reporting different results with different tasks. Watched a live stream that compared GPT-5, Gemini Pro 2.5, Claude 4 Sonnet, and GLM 4.5, and GPT-5 appeared to not follow instructions as well as the other three. At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models (many benchmarks can be gamed). The blurring boundaries between technical overview, news, opinions and marketing is truly concerning. |
|
You are not going to get the same output from GPT5 or Sonnet every time.
And this obviously compounds across many different steps.
E.g. give GPT5 the code to a feature (by pointing some files and tests) and tell it to review it and find improvement opportunities and write them down: depending on the size of the code, etc, the answers will slightly different.
I often do it in Cursor by having multiple agents review a PR and each of them: - has to write down their pr-number-review-model.md (e.g. pr-15-review-sonnet4.md) - has to review the reviews of the other files
Then I review it myself and try to decide what's valuable in there and what not. And to my disappointment (towards myself): - often they do point to valid flaws I would've not thought about - miss the "end-to-end" or general view of how the code fits in a program/process/business. What do I mean: sometimes the real feedback would be that we don't need it at all. But you need to have these conversations with AI earlier.