Hacker News new | ask | show | jobs
by dokka 481 days ago
It's good at taking code of large, complex libraries and finding the most optimal way to glue them together. Also, I gave it the code of several open source MudBlazor components and got great examples of how they should be used together to build what I want. Sure, Grok 3 and Sonnet 3.7 can do that, but the GPT 4.5 answer was slightly better.
2 comments

> Sure, Grok 3 and Sonnet 3.7 can do that, but the GPT 4.5 answer was slightly better.

Sonnet 3.7: $3/million input tokens, $15/million output tokens [0]

GPT-4.5: $75/million input tokens, $150/million output tokens [1]

if it's 10-25x the cost, I would expect more than "slightly better"

0: https://www.anthropic.com/news/claude-3-7-sonnet

1: https://openai.com/api/pricing/

It really depends on how much it actually costs for a task though. 10x more of almost nothing isn't important.
there's a $1 widget and a slightly better $10 widget.

if you're only buying 1 widget, you're correct that the price difference doesn't matter a whole lot.

but if you're buying 10 widgets, the total cost of $10 vs $100 starts to matter a bit more.

say you run a factory that makes and sells whatchamacallits, and each whatchamacallit contains 3 widgets as sub-components. that line item on your bill of materials can either be $3, or $30. that's not an insignificant difference at all.

for one-off personal usage, as a toy or a hobby - "slightly better for 10x the price" isn't a huge deal, as you say. for business usage it's a complete non-starter.

if there was a cloud provider that was slightly better than AWS, for 10x the price, would you use it? would you build a company on top of it?

It's unfortunate it is named 4.5 -- it is next generation scale, and it's a 1.0 of next-generation scale.

Sonnet is on its 3rd iteration, i.e. has considerably more post-training, most notably, reasoning via reinforcement learning.

It's not really the beginning (1.0) of anything - more like the end given that OpenAI have said this'll be the last of their non-reasoning models - basically the last scale-up pre-training experiment.

As far as the version number, OpenAI's "Chief Research Officier" Mark Chen said, on Alex Kantrowitz's YouTube channel, that it "felt" like a 4.5 in terms of level of improvement over 4.0.

That's a lot of other stuff, and you express disagreement.

I'm sure we both agree it's the first model at this scale, hence the price.

> It's not really the beginning (1.0) of anything

It is a LLM w/o reasoning training.

Thus, the public decision to make 5.0 = 4.5 + reasoning.

> "more like the end...the last scale-up pre-training experiment."

It won't be the last scaled-up pre-training model.

I assume you mean, what I expect, and you go on to articulate: it'll be last scaled-up-pre-training-without-reasoning-training-too-relesed-publicly model.

As we observe, the value to benchmarks of, in your parlance, scaled-down pretraining, with reasoning training, is roughly the same as scaled-up pre-training without reasoning training.

> Yes it is. It's the first model at this scale.

Is it? Bigger than Grok 3? How do you know - just because it's expensive?

At some point, I have to say to myself: "I do know things."

I'm not even sure what the alternative theory would be: no one stepped up to dispute OpenAI's claim that it is, and X.ai is always eager to slap OpenAI around.

Let's say Grok is also a pretraining scale experiment. And they're scared to announce they're mogging OpenAI on inference cost because (some assertion X, which we give ourselves the charity of not having to state to make an argument).

What's your theory?

Steelmanning my guess: The price is high because OpenAI thinks they can drive people to Model A, 50x the cost of Model B.

Hmm...while publicly proclaiming, it's not worth it, even providing benchmarks that Model A gets the same scores 50x cheaper?

That doesn't seem reasonable.

Versions numbers for LLMs don't mean anything consistent. They don't even publicly announce at this point which models are built from new base models and which aren't. I'm pretty sure Claude 3.5 was a new set of base models since Claude 3.

What do mean by "it's a 1.0" and "3rd iteration"? I'm having trouble parsing those in context.

If Claude 3.5 was a base model*, 3.7 is a third iteration** of that model.

GPT-4.5 is a 1.0, or, the first iteration of that model.

* My thought process when writing: "When evaluating this, I should assume the least charitable position for GPT-4.5 having headroom. I should assume Claude 3.5 was a completely new model scale, and it was the same scale as GPT-4.5." (this is rather unlikely, can explain why I think that if you're interested)

** 3.5 is an iteration, 3.6 is an iteration, 3.7 is an iteration.

How do you feed them large code bases usually?
Use an AI-supporting editor like Cursor, or GitHub CoPilot, or perhaps Sonnet 3.7's GitHub integration.