Hacker News new | ask | show | jobs
by ThunderBee 802 days ago
The most surprising thing to me is that Opus is only slightly in the lead.

I was feeding multiple python and c# coding challenges / questions to both and Opus blew GPT4 out of the water on every single task. Didn’t matter if I was giving them 50 lines or 5,000 Opus would consistently give working/correct solutions while GPT4 preferred to answer with pseudo code, half complete code with ‘do the thing here’ comments, or would just tell me that it’s too complicated.

3 comments

Another data point, I definitely find Opus better for coding, but not by much. The problems I give them are generally short (<= 100 lines) and well-defined so any advantage Opus has in larger contexts won't be apparent to me. They're also generally novel problems but NOT particularly challenging (anyone with a BS CS should be able to solve them in < 1hr).

I have them working with mostly C++ and Clojure, a bit of Python, and Vimscript every once in a while. Both models are much better at Python and fairly bad at Vimscript. Clojure failure cases are mostly from invented functions and being bad at modifying existing code. I can't pick out a strong pattern in how they fail with C++, but there have been a few times where GPT4 ends up looping between the same couple unworkable solutions (maybe this indicates a poor understanding of prior context?).

Spot on. People need to say what they are actually using the models for and not just "coding".

I mostly use it to make react/javascript front ends to a python/fastapi backend and chatGPT4 is great at that.

I tried to write a piece of music though in the old Csound programming language and it barely even works.

It will be interesting to see how the context plays out because I have noticed that I can often give it extra context that I think will be helpful but end up causing it to go down a wrong path. I might even say my best results have been from the most precise instructions inside the smallest possible context.

It's because LMSYS is an aggregate elo across a range of different tasks. Individually in some very important areas, Claude Opus may be better than GPT-4 by 50-100 elo points which is quite a lot. However there are specific domains where GPT-4 has the advantage because it's been fine tuned based off a lot of existing usage. So weak points around logic puzzles or specific instructions don't bring down its elo whereas Claude Opus doesn't have this advantage yet. I believe Opus's eventual elo, after all these little areas of weakness are fine tuned, will be something like 1300.
Yeah GPT is incredibly lazy, ironically 3 is far better at not being lazy than 4.

I guess you benchmarked via API? I've heard even the datestamped models have been nerfed from time to time..