| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by FloatArtifact 475 days ago

> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

"Memory bandwidth usage should be limited to the 37B active parameters."

Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.

Context window?

How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?

1 comments

valine 475 days ago

With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.

ein0p 475 days ago

What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.

valine 475 days ago

No one should be buying this for batch inference obviously.

I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.

doctorpangloss 475 days ago

Sure, nuance.

This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.

jonfromsf 475 days ago

Birkin bags appreciate in value. This is more like a Lexus. It's a well-crafted luxury good that will depreciate relatively slowly.

fennecfoxy 471 days ago

Have you seen prices on Lexus LFAs now? They haven't depreciated ha ha. And for those that don't know: https://www.youtube.com/watch?v=fWdXLF9unOE

hot_gril 475 days ago

Computers don't usually depreciate slowly

km3r 475 days ago

Relatively, as in a Mac or a Lexus will depreciate slower than other computers/cars.

ein0p 475 days ago

Pretty much. In addition, PyTorch on the Mac is abysmally bad. As is Jax. Idk why Apple doesn't implement proper support, seems important. There's MLX which is pretty good, but you can't really port the entire ecosystem of other packages to MLX this far along in the game. Apple's best bet to credibly sell this as "AI hardware" is to make PyTorch support on the Mac excellent. Right now, as far as AI workloads are concerned, this is only suitable for Ollama.

DevKoala 475 days ago

This is true. Not sure why you are getting downvoted. I say this as someone who ordered a maxed out model. I know I will never have a need to run a model locally, I just want to know I can.

ein0p 475 days ago

I run Mistral Large locally on two A6000's, in 4 bits. It's nice, but $10K in GPUs buys a lot of subscriptions. Plus some of the strongest LLMs are now free (Grok, DeepSeek) for web use.

DevKoala 475 days ago

I hear you. I make these decisions for a public company.

When engineers tell me they want to run models on the cloud, I tell them they are free to play with it, but that isn’t a project going into the roadmap. OpenAI/Anthropic and others are much cheaper in terms of token/dollar thanks to economies of scale.

There is still value in running your models for privacy issues however, and that’s the reason why I pay attention to efforts in reducing the cost to run models locally or in your cloud provider.

Der_Einzige 475 days ago

No one who is using this for home use cares about anything except batch size 1 sequence size 1.

ein0p 475 days ago

What if you're doing bulk inference? The efficiency and throughput of bs=1 s=1 is truly abysmal.

saagarjha 475 days ago

People want to talk to their computer, not service requests for a thousand users.

rfoo 475 days ago

For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).

Anything in between suffers.

bick_nyers 475 days ago

Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.

valine 475 days ago

Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.