| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nl 55 days ago

A Opus 4.7/Gpt5.5 class model is 5 trillion parameters[1].

To run a 8 bit quantized version of that you need roughly 5TB of RAM.

Today that is around 18 NVidia B300. That's around $900,000, without including the computers to run them in.

It's true that the capability of open source models is improving, but running actual frontier models on your MPB seems a way off.

[1] https://x.com/elonmusk/status/2042123561666855235?s=20 (and Elon has hired enough people out of those labs to have a fair idea)

9 comments

crazylogger 54 days ago

People had this "why you probably can't run a GPT-4 (or even GPT-3.5) class model on your MBP anytime soon" conversation before.

Today's LLMs are able pack much more capabilities into fewer parameters compared to 2023. We might still be at the very rudimentary phase of this technology there are low-hanging efficiency gains to be had left and right. These models consume many orders of magnitude more energy than a human brain, this all seems like room for improvement.

The right question: is there a law in information theory that fundamentally prevents a 70B model of any architecture from being as smart as Opus 4.7?

kvern 54 days ago

There is a huge gap between "in two years" and "theoretically possible"

hnben 54 days ago

>> People had this "why you probably can't run a GPT-4 (or even GPT-3.5) class model on your MBP anytime soon" conversation before.

Difwif 54 days ago

The OP said "as capable as the frontier cloud models are today" which might assume model improvements that do more with less. Opus 4.7/Gpt5.5 performance might be achievable with a fraction of the parameters.

spflueger 54 days ago

Exactly. I also feel like being able to choose a model for the use case could be worth an idea. So instead of trying to squeeze all kinds of knowledge into a single model, even if it's moe, just focus models on use cases. I bet you only need double digit billion parameter models for that with same or even better performance

ako 54 days ago

Opus and Gpt are generic LLMs with knowledge on all sort of topics. For specific use cases you probably don't need all the parameters? Suppose you want to generate code with opencode, what part of the generic LLM is needed and what parts can be removed?

byzantinegene 54 days ago

we're already doing that, it's called distillation and how models like deepseek are trained.

Nimitz14 54 days ago

I think your own math leads to the conclusion the public apis are not serving models of that size. They couldn’t afford to

hedgehog 55 days ago

As far as I can tell Minimax M2.7 is better than anything available a year ago, but it runs on an ordinary PC. Will that continue? Not sure, but the trend has continued for the last two years and I don't know of any fundamental limits the models are approaching.

ricardobayes 54 days ago

I wish more people were more aware of this. I think so much of the current optimism is based on "it doesn't matter if companies are raising prices since I'm just going to run the model locally", doesn't fly.

otabdeveloper4 54 days ago

> A Opus 4.7/Gpt5.5 class model is 5 trillion parameters.

Or so they say.

If it's true then that just shows how far behind the cloud providers are lagging while wasting investor money.

(There's a huge amount of diminishing returns in increasing parameter counts and the intelligent AI company should be hard at work figuring out the optimal count without overfitting.)

rurban 54 days ago

Do that will only be possible with something like better 3D NAND flash memory, needs a new hardware. People are already trying to bring that the market. Contemplated taking a compiler position in such a company.

zozbot234 54 days ago

HBF is a non-starter, it runs way too hot compared to DRAM (which only pays for refresh at idle) for the same memory traffic. Only helps for extremely sparse MoE models - probably sparser than we're seeing today.

zozbot234 54 days ago

> A Opus 4.7/Gpt5.5 class model is 5 trillion parameters[1].

You could run it on a cluster of nodes that each do some mix of fetching parameters from disk and caching them in RAM. Use pipeline parallelism to minimize network bandwidth requirements given the huge size. Then time to first token may be a bit slow, but sustained inference should achieve enough throughput for a single user. That's a costly setup of course, but it doesn't cost $900k.

> You could run it on a cluster of nodes

Not sure this is a MBP either.

bigyabai 54 days ago

Not even a cluster of Mac Pros could run a dense 5T parameter model with RDMA, to my knowledge.

zozbot234 54 days ago

SOTA models are reportedly MoE, not dense.

bigyabai 54 days ago

A 5T MoE model is still bottlenecked by streaming weights from SSD, in addition to compute bottlenecks during prefill and decode.

zozbot234 54 days ago

True but a cluster built on pipeline parallelism can naturally stream from multiple SSD's in parallel. That probably makes offload somewhat more effective. And you also have RAM caching available as a natural possibility.