Hacker News new | ask | show | jobs
by bottlepalm 724 days ago
It'd be interesting to see how Sonnet 3.5 does at this. I've found Sonnet a step change better than Opus, and for a fraction of the cost. Opus for me is already far better than GPT-4. And same as the poster found, GPT-4o is plain worse at reasoning.

Edit: Better at chain of thought, long running agentic tasks, following rigid directions.

2 comments

That's an interesting question - I'll take a few pokes at it now to see if there's improvement.
Update: Sonnet 3.5 is better than any other model for the circuit design and part finding tasks. Going to iterate a bit on the prompts to see how much I can push the new model on performance.

Figures that any article written on LLM limits is immediately out of date. I'll write an update piece to summarize new findings.

That name threw me for a loop. 'Sonnet' already means something to EEs ( https://www.sonnetsoftware.com/ ).
Yeah same here. Thought Sonnet had added some ML stuff into their EM simulator.
Opus is better than GPT-4? I've heard mixed experiences.
That's because the sample size is probably small and for niche prompts or topics.

It's very hard to evaluate whether a model is better than another, especially doing it in a scientifically sound way is time consuming and hard.

This is why I find these types of comments like "model X is so much better than model Y" to be about as useful as "chocolate ice cream is so much better than vanilla"

And both flavors have a base flavor of excrement... Still, since I started using Claude 3 Opus (and now 3.5 Sonnet) a couple of months back, I don't see myself switching from them nor stopping use of LLM-based AI tech; it's just made me feel like the computer is actually working for and with me and even that alone can be enough to get me motivated and accomplish what I set out to do.
"it's just made me feel like the computer is actually working for and with me and even that alone can be enough to get me motivated and accomplish what I set out to do."

This is a great way to describe what I've been feeling / experiencing as well.

Just an update on my initial impressions of Claude 3.5 Sonnet. It's a better programmer than I am in Python; that's not saying much, but this is now two nights in a row I've been impressed with what I've created with it.
True, I just tried it for generating a book summary, and Sonnet 3.5 was very bad. GPT-4o is equally bad at that , gpt-4-turbo is great.
This more likely has to do with context length?
No, all the information is there, but gpt-4o tends to produce bullet points (https://www.thesummarist.net/summary/the-making-of-a-manager...), whereas gpt-4-turbo tends to produce much more readable prose (https://www.thesummarist.net/summary/supercommunicators/the-...).
How is prose more readable than bullets?
It really depends on the type of question, but generally I'm between Gemini and Claude these days for most things.
Opus 3.5 is not yet released.
I assume the GP was talking about 3.0.