| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ThunderBee 848 days ago
	The most surprising thing to me is that Opus is only slightly in the lead. I was feeding multiple python and c# coding challenges / questions to both and Opus blew GPT4 out of the water on every single task. Didn’t matter if I was giving them 50 lines or 5,000 Opus would consistently give working/correct solutions while GPT4 preferred to answer with pseudo code, half complete code with ‘do the thing here’ comments, or would just tell me that it’s too complicated.

3 comments

happypumpkin 848 days ago

Another data point, I definitely find Opus better for coding, but not by much. The problems I give them are generally short (<= 100 lines) and well-defined so any advantage Opus has in larger contexts won't be apparent to me. They're also generally novel problems but NOT particularly challenging (anyone with a BS CS should be able to solve them in < 1hr).

I have them working with mostly C++ and Clojure, a bit of Python, and Vimscript every once in a while. Both models are much better at Python and fairly bad at Vimscript. Clojure failure cases are mostly from invented functions and being bad at modifying existing code. I can't pick out a strong pattern in how they fail with C++, but there have been a few times where GPT4 ends up looping between the same couple unworkable solutions (maybe this indicates a poor understanding of prior context?).

link

xetsilon 848 days ago

Spot on. People need to say what they are actually using the models for and not just "coding".

I mostly use it to make react/javascript front ends to a python/fastapi backend and chatGPT4 is great at that.

I tried to write a piece of music though in the old Csound programming language and it barely even works.

It will be interesting to see how the context plays out because I have noticed that I can often give it extra context that I think will be helpful but end up causing it to go down a wrong path. I might even say my best results have been from the most precise instructions inside the smallest possible context.

link

theaussiestew 848 days ago

It's because LMSYS is an aggregate elo across a range of different tasks. Individually in some very important areas, Claude Opus may be better than GPT-4 by 50-100 elo points which is quite a lot. However there are specific domains where GPT-4 has the advantage because it's been fine tuned based off a lot of existing usage. So weak points around logic puzzles or specific instructions don't bring down its elo whereas Claude Opus doesn't have this advantage yet. I believe Opus's eventual elo, after all these little areas of weakness are fine tuned, will be something like 1300.

link

dontupvoteme 848 days ago

Yeah GPT is incredibly lazy, ironically 3 is far better at not being lazy than 4.

I guess you benchmarked via API? I've heard even the datestamped models have been nerfed from time to time..

link