Hacker News new | ask | show | jobs
by blueblimp 841 days ago
I wonder what's happening to all that money. Back when they originally released Claude, they were second only to OpenAI as far as chatbot models were concerned. Although Claude wasn't as smart as GPT-4, it had a more pleasant writing style, and Anthropic later released 100k context. At the time, I expected Anthropic to be the next company to release a GPT-4-level model.

But since then Claude has been passed by Mistral's mistral-medium and Google's Gemini Ultra. More concerningly for Anthropic, each subsequent release of Claude has actually performed _worse_ on the Chatbot Arena Leaderboard. (Claude-1 outranks Claude-2.0, which outranks Claude-2.1.) The reason for the decline in ranking is seemingly that the most noticeable update is to make the model refuse more requests.

In an additional blow, the needle-in-a-haystack independent benchmark revealed that Claude's long context is not actually used effectively by the model.

All-in-all, Anthropic is not looking in a good spot, despite the massive investment. They need to start releasing legitimately better models, or risk irrelevance.

7 comments

A bit off-topic, but it's very cool how competitive this market is. It's only been barely a year and we're all expecting every company to do better within spans of months. Even compared to haydays of cloud hosting investments, people weren't expecting updates from each company on a weekly basis.

Part of me really wants to get into the game somehow, as it looks very invigorating and motivating from outside. Although not entirely sure where I would need to start as I'm not at some cutting edge AI/ML company.

We do need to cut them some slack; no other category has ever moved this fast.
For getting a flavor of training LLMs without needing to be at one of the pre-training companies, it's very accessible to fine-tune a relatively small open source LLM such as Mistral 7B. (There are many tutorials.)
I think there's a lot of funny business going on with accounting, misinformation, and hype generation. I'm kinda tangentially involved (I work at one of the players on projects that use LLMs, but don't train them), but wouldn't want to get directly involved in the core areas, because I suspect that by the time I could develop the skillset the hype cycle will end and the bubble will burst.

It could be an attractive place to be after the bubble bursts - there are some real technological developments that have been made, just not nearly as revolutionary or widely applicable as have been claimed. But AI has a history of 10-20 year hype cycles, so it could be a while before it gets hot again afterwards.

in a gold rush, sell shovels.
At the MIT event, Altman was asked if training GPT-4 cost $100 million; he replied, “It's more than that.”

Training costs are decreasing, but whenever there's an update to GPT-4 (e.g. training data cutoff updated to December 2023), it means the model has been retrained. The compute costs of training a model like Claude 2 are significant.

Also, keep in mind that not every trained model becomes ready for production. Some are discarded, similar to how you might burn a few cookies while baking.

I don't think they always retrain models from scratch. Sometimes they might do continual learning (take the old model and train it on newer data)
Seems like the Occam's razor hypothesis would be "company that split from OpenAI over safety concerns is overly focused on safety at the expense of performance."
That's not Occam's razor, that's reifying personal biases.
Claude doesn't refuse requests if you have API access and use a prefill, for what it's worth. If anything it actually complies more than GPT 4.

When you say Claude is worse than Mistral Medium are you going by the Chatbot Arena Leaderboard or some other benchmark? I wouldn't use Claude for coding but I find it to be pretty good at other tasks.

> When you say Claude is worse than Mistral Medium are you going by the Chatbot Arena Leaderboard or some other benchmark?

I'm mainly going by the arena leaderboard, but it's also true in my limited experience with the two. (I mainly use either GPT-4 or open models.) And it's the only model I can remember getting an ethical refusal from. (I don't push hard in that aspect, so it was surprising.) I know it can be jailbroken, but, precisely because I don't push the models hard, I'm not skilled at jailbreaking.

By the way, the mention of API access reminded me how weird it is that the Claude API is still application-only, unlike OpenAI, Google, and Mistral.

I wonder if it's a situation where the knowledge required to improve and enhance these models at a fundamental level vs just more data for testing or screwing around with system prompts is so deep and rare that a single person leaving an organization can have a gigantic impact on the overall trajectory of a startup. No amount of money will fix that problem if the skills required are that rare. Established large companies like Google, Facebook, etc have a much deeper bench of specialists than a startup I would assume and so can survive some people jumping ship.
I doubt it. A lot of this is like voodoo and throwing darts at a board.

Some people developed some intuition for it, but it's, well, random. We're evolving big things we don't really understand, and for reasons we don't really understand, some things evolve better than others. At some point, people starting getting feelings in their gut that some "architectures" worked better for vision than NLP, and things diverged. There continues to be queasy-gut intuition, but when you read all the papers, if you cut through all the hairy math language, it's a lot of voodoo and throwing darts at a board.

I think it's much more the same problem you'd have at any shop, where the smartest people want to work for FAANG / sexy startups / elite universities, etc., and the bottom end of the market just wants a job, and lands with lower-tier employers. Talent naturally consolidates. At the same time, org culture, strategy, and leadership can have a huge impact too.

They want to tank the company to the point another big tech buys it out. And since they focus so much on "safety", that big tech could be Apple or some other conservative company.
I've been using Claude recently and as a product, regardless of the benchmarks it feels much nicer to use, I haven't spent a lot of time playing with open models though
I just tried Claude and it give good concise responses to software questions. I don't like AI services from Microsoft or Google because I have been burned out by them more than once when they change the terms of their services or kill a whole services just because. Not that Open AI or Anthropic can do the same but at least I give them the benefit of the doubt.
OpenAI does the same, indeed definitely much more often than Microsoft and probably more often than even Google. In fairness, it's all tagged beta. However:

1) Models change. Things I did a year or two ago behave differently, and usually stupider.

2) OpenAI just announced they're killing GPT-3 and other older models. It's annoying, since I like GPT-3. It doesn't have "safety" built in, which makes it better at a lot of creative work. If you have other models baked into your codebase, you're looking at a massive migration.

Extra credit: Find any public information from OpenAI about this. See how long it takes you. (in fairness, customers did receive an email)

3) OpenAI decided to kill the completion API. This is annoying, since a lot of my uses don't look like a chatbot. I understand chatbot is THE killer app that's come up, but a lot of stuff works better as, well, completion. Indeed, something broke in the past month since I am getting

Extra credit: Same as above.

A normal company would continue to include this stuff on e.g. their pricing page marked "deprecated," have announcement, etc. Here, it's like GPT-3 never existed. And I really miss the old soulful models.

There's a major move towards open-source models in my industry since a lot of things built a year ago on OpenAI no longer work or at least no longer work the same.

Critically, you can't validate apps built on cloud-based services since you have no idea when models change and your app might suddenly do something dumb.