Hacker News new | ask | show | jobs
by netsec_burn 728 days ago
Opus remained better than GPT for me, even after the release of GPT-4o. VERY happy to see an even further improvement beyond that, Claude is a terrific product and given the news that GPT-5 only began its training several weeks ago I don't see any situation where Anthropic is dethroned in the near term. There are only two parts of Anthropic's offering I'm not a fan of:

- Lack of conversation sharing: I had a conversation with Claude where I asked it to reverse engineer some assembly code and it did it perfectly on the first try. I was stunned, GPT had failed for days. I wanted to share the conversation with others but there's no way provided like GPT, and no way to even print the conversation because it cuts off on the browser (tested on Firefox).

- No Android app. They're working on this but for now, there's only an iOS app. No expected ETA shared, I've been on the waitlist.

I feel like both of these are relatively basic feature requests for a company of Anthropic's size, yet it has been months with no solution in sight. I love the models, please give me a better way of accessing them.

13 comments

Both GPT-4 and 4o have been completely useless for coding in the past couple of weeks for me - constant errors, and not just your typical LLM inaccuracies but incapable of producing a few lines of self-consistent code e.g. defines variables foo on one line and refers to it as bar on the next, or it misspells it as foox.
Waht language? Because I'm guessing they work well for languages with a large amount of training data like Python (in my experience), less well for less used languages like Zig or Clojure (haven't tried them but that's my theory)
From my experience, GPT-4 works well with both Clojure and Zig. A lot of it depends on the way you prompt though. For example, asking to start with a C or C++ example and converting to Zig often works better than starting straight with Zig. The same strategy works with Java and Clojure too.
I use it for Rust and it's.... meh. It gets things wrong enough that I don't reach for it except to help me reference certain docs. It tends to hallucinate APIs and semantics that just don't exist. Honestly couldn't imagine using it with a dynamic language.
Python here. And like they said, only noticable in the last few weeks.
I've been seeing this too. Always hard to tell what's a real change vs the rolls of the dice lately but I've been having weird python inconsistencies too, in very short snippets doing pretty simple things.
For me it has been very repetitious despite my instruction to the contrary.
I've been experiencing bizarre typos and misspellings that I've come to describe as the model being drunk. Things like it writing peremeter instead of parameter
Yeah, misspellings were something so rare that I thought an LLM was incapable of producing them.

Yet over the past few weeks GPT-4 and 4o make them all the time. It will randomly change my postgres schema from public to publish. And, well, just this one for yourself:

> *Using the 'kubectl cp Command*: Execute the 'czygk cp' command to copy the file from your local machine to the pod.

Today, I asked 4o how to get around conditionally executing React hooks (illegal in React) and it rewrote my code to simply do it again but it merely swapped the order of a ternary, performance possibly worse than gpt3.

Maybe they’re weakening it because they expanded their free tier, but it has become surprisingly bad.

The level of misspelling is insane at the moment. It does it almost 50%+ of the times. I just started using claude 3.5 and the difference is night and day.
It's the same model though. Maybe your perception has changed.
I have first noticed logprob fluctuations in GPT-4o. Perhaps the same phenomenon is also going on with Turbo. I din‘t recall specifics but it was naming inconsistencies with variable names, meaning: same variable name got a typo somewhere, but the typo was close enough - perhaps a space vs. an underscore or something like that.

Model could be the same, but maybe some in the infra is different.

I can’t speak for what OpenAI is doing, but I’ve noticed those types of hallucinations occurring when I quantize a model beyond a certain point.

Maybe they are trying to cut down on memory usage ?

Is it the same? On the Models page of the API docs it says that GPT-4 is using the June 13th which would be different than the March 23rd.
> I had a conversation with Claude where I asked it to reverse engineer some assembly code and it did it perfectly on the first try. I was stunned

I share the same experience with you but with Claude 3 Sonnet. I can’t count how many times I’ve shared some code with Claude with barely any hope because other GPTs failed aswell, yet, Claude surprised me and performed the task with success.

I’ve actually reached to the point that I expressed my gratitude to Claude because of how well it performs on coding tasks and other tasks in general. I don’t know what Anthropic did, but something did they right.

Being able to handle large amounts of tokens, “understand” and perform tasks on it & spit out large amounts of data back with barely any cut-offs (unlike Gemini) has made me feel like Claude is at the moment the best option.

I do wonder if GPT quality fluctuates seasonally, or with electricity costs, in an engineering effort to balance costs with performance.

I agree on all your points, but would like to emphasize that I really do enjoy the voice input voice output thing that chatgpt's app has. Its not how I use it when working, but when commuting, a lot of times, I'll turn on the the chatgpt app and have a conversation with it exploring ideas related to work or side projects. Its better than NPR, and I can't listen to the '3d6 Down the Line' podcast everyday, just once a week.

I've been subscribed to PHind, which is a decent service allowing access to their models, chatgpt 4 turbo and o, and claudes. Its been incredibly useful, especially with their search integration. Unfortunately, while chatgpt can be used 500 times a day, Claude is only 10, although I guess it goes into an API like payment mode after that on top of subscription.

I sure wish I'd buckle down and calculate my usage to really get an idea of whether subscription is cheaper or more expensive for me compared to API.

Short of switching between models (which at least OpenAI definitely does for free customers, but I believe they always indicate it), how would that work? Different quantizations?
caught me speculating. I suppose some mild quanting and/or prompt injection to keep responses smaller unless specifically asked: e.g. use ...
> Lack of conversation sharing... [there is] no way to even print the conversation because it cuts off on the browser (tested on Firefox).

Until they make conversations shareable, in the meantime you can print the whole page in Chrome by:

- going to Developer Tools (Ctrl + Shift + I)

- opening the Command Palette (Ctrl + Shift + P)

- searching for 'screenshot'

- selecting Capture full size screenshot

I recently released Slackrock [https://github.com/coreylane/slackrock] that you may find helpful, it's a Slack chat app that can access several FMs (including Claude 3.5) via AWS Bedrock. Responses can be easily shared with others by inviting them to your channels, and Slack has an Android app. It doesn't support attachments (yet) but I'm working on it!
cool!
> Lack of conversation sharing

You can use my product https://ChatHub.gg which supports dozens of chatbots including Claude and can share conversations from any of them.

If you have an API key, using Opus with a 3rd party UI like typingmind.com solves all of the problems you mentioned (disclaimer: I'm the app developer)
I use LibreChat for this as self hosted UI. Works awesome.
I'm sticking w/ Claude for the foreseeable future as they seem less slimy than OpenAI/Microsoft/Google so far and care about safety.

I'm in the same boat waiting for an Android app btw. One other feature that I'm hoping they catch up to others on is a permanent context window so that I can get Claude to stop speaking so formally all the time

To each their own, but I still prefer ChatGPT. The UI for Claude is terrible in my opinion.

I had subscriptions for both and I would fire off questions to both of them and see which one I liked more and I consistently liked the ChatGPT ones more. I canceled my subscription last week for Claude. I am super happy that Anthropic continues to push the envelope on this and I hope to re-subscribe to them in the future.

If it's really only the UI that's bothering you, why not use a web UI such as Open WebUI?
The UI wasn’t the only issue, but I will look into that.
> GPT-5 only began its training several weeks ago

Source?

https://openai.com/index/openai-board-forms-safety-and-secur... (May 28th)

> OpenAI has recently begun training its next frontier model and we anticipate the resulting systems to bring us to the next level of capabilities on our path to AGI.

No doubt openai have been training big models for the last year. If “gpt5” is only just starting it means recent training runs have had disappointing results and have been passed off as “Gpt4o” or whatever.

The value of all the AI companies is predicated on high chance of AGI, and gpt5 failing to be revolutionary may pop the whole bubble (+10 trillion of market cap)

Sam said on Lex's podcast that people should temper their expectations for GPT-5, not in that it will necessarily suck, but that they want to ramp up ability slowly over time rather than discrete large steps.
Sounds like an excuse tbh. Esp when other companies are pushing ahead beyond OAI and open source is close to rivaling them
Yeah. Sam wants to productize $$$ what they have now rather than sink time and money training future models with uncertain outcomes. I suspect that difference in focus is what Ilya Sutskever means by wanting to “advance capabilities as fast as possible” in the in Safe Superintelligence Inc. announcement.
> If “gpt5” is only just starting it means recent training runs have had disappointing results and have been passed off as “Gpt4o” or whatever.

Sora probably took a lot of cluster time don't you think?

Based on other things they said in the last couple of months, it looks like GPT-4.5 is coming this summer, and then GPT-5 in the Fall.
I've had way better success with GPT-4o than claude. I wonder why
Have you tried 3 Opus or 3.5 Sonnet? Are you using it for programming, or something else?
everything really. just opus so far
Personal prompting style, I imagine,
People really, really, underestimate how important prompting is.

I would be confident in stating that half the people who complain about a model are actually just suffering from poor prompting.

And what makes you so confident that all those people are using different prompt styles when comparing models? You think most people don’t even understand the bare basics of how to compare two products?
That's the point: maybe someone has a personal prompting style that works great with Claude but gives worse results with GPT-4.

They might complain that GPT-4 is rubbish in comparison to Claude, but someone with a different personal prompting style might experience the opposite.

Having a prompting style that works with a model but not quite with another is much different than "suffering from poor prompting" the previous person was accusing others.

And given that those are tools, it's more like "the model can work with the user's prompts" rather than "the user's prompts are adapted to the model".

Unless we're here for an ego trip.

Ah, I see. I’d be interested to see a study on that. I find it hard to believe it would make such a stark difference but it’s possible.
are non-snake oil prompting techniques described anywhere?
Those are hard to come by, but the Anthropic prompting documentation is a pretty great source: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...
On the plus side, at least ChatBoost supports both openai and claude API. But for this specific model it seems to be broken... I hope that gets noticed and fixed soon.
What I understand is that it's GPT 6 that just went into training, and that GPT 5 is complete and being delayed until after the U.S. election.
And after GPT-5's release, what would be the plan for subsequent elections? This seems to be a temporary play in delaying AI regulation if public sentiment further becomes that AI can have a strong influence in the elections.
It’s absolutely temporary, but 4 years feels like an eternity in this field and the m sure the major players would love to have that much time to entrench themselves before they have to battle “AI ban” legislation.
GPT-5 will make elections obsolete :)
Roko would be proud of you. I welcome our new electric masters.
Managed democracy offers absolute freedom; freedom from the burden of choice
It there any online confirmation of this, that's more than speculation?
No there is not
(assuming you are correct) It says something about how a company feels about the safety of their products when they feel like they should time the releases based on political events.
This is speculation because I don’t think any of the key players ever explicitly stated this is their strategy, but this year it feels like there’s some significant foot dragging on things like Sora and GPT-5. The big AI players really don’t want AI to become an election year punching bag and don’t want any major campaign promises around AI to placate a spooked electorate. And they really don’t want it to be revealed that generative AI powered bot armies outnumber real human political discourse 10-1. And they absolutely do not want an AI generated hoax video to have a measurable effect on the polls.

It’s a stopgap. If we get through this election without a major public freak out, it gives the industry 4 more years to take LLMs out to the point of diminishing returns and figure out safety before we get knee jerk regulation.

This is pure speculation, right?
Here's something that talks about it. I can't speak for the legitimacy, but I'm not pulling it out of my ass. They may be pulling it out of theirs. :-)

https://lifearchitect.ai/gpt-6/

I've listened to so many interviews that I couldn't tell you who said what at this point, but that is what I understood from somewhere. So, sure, take it as speculation.
Source: trust me bro
I also believe that gpt-4o was originally called gpt-5. If you look at the image generation on their website from gpt-4o which has not been released, I believe that along with the voice caused Ilya to declare mission accomplished (AGI) and that is why there was a coup. The coup failed because no one wanted to wrap up the company or change the way it operated because they would lose a lot of money.

The reason the name was changed was because there was a big public scare about gpt-5 taking over and so Altman had to promise not to release gpt-5 soon. So they changed the name to gpt-4o (omni). Which is A) obviously dramatically a different architecture, B) a huge step up in capabilities (most still unreleased) C) very general purpose. Because of A) and B), this should obviously be a new major version (5).

Yes, this is speculation, but it's very obvious speculation to me. It's weird for me that most people not only don't share this view but seem to absolutely hate when I say it.

I don't hate this speculation, I just don't buy it at all. 4o's about the same in terms of reasoning as 4. People don't find the text abilities that much more usable over 4 (at least on the LMS leaderboard). It's faster and has audio2audio capabilities alongside new native image stuff I think, but how exactly is that AGI if 4 isn't? These models understanding and reasoning ability is still far too weak to do any serious economic shifts yet.
Scroll to Explorations of Capabilities: https://openai.com/index/hello-gpt-4o/

That combined with the voice was probably considered AGI by Ilya.

Yes, I've seen this. Read my comment.
It's speculation with no basis at all, OAI has a track record of releasing half step models and 4o is no different just like 3 to 3.5 and the numerous subsequent 3.5 releases.

If you've used 4 and 4o they are too similar for 4o to have been trained from scratch