Hacker News new | ask | show | jobs
by valine 475 days ago
Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.
1 comments

> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

"Memory bandwidth usage should be limited to the 37B active parameters."

Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.

Context window?

How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?

With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.
What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.
No one should be buying this for batch inference obviously.

I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.

Sure, nuance.

This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.

Birkin bags appreciate in value. This is more like a Lexus. It's a well-crafted luxury good that will depreciate relatively slowly.
Have you seen prices on Lexus LFAs now? They haven't depreciated ha ha. And for those that don't know: https://www.youtube.com/watch?v=fWdXLF9unOE
Computers don't usually depreciate slowly
Pretty much. In addition, PyTorch on the Mac is abysmally bad. As is Jax. Idk why Apple doesn't implement proper support, seems important. There's MLX which is pretty good, but you can't really port the entire ecosystem of other packages to MLX this far along in the game. Apple's best bet to credibly sell this as "AI hardware" is to make PyTorch support on the Mac excellent. Right now, as far as AI workloads are concerned, this is only suitable for Ollama.
This is true. Not sure why you are getting downvoted. I say this as someone who ordered a maxed out model. I know I will never have a need to run a model locally, I just want to know I can.
I run Mistral Large locally on two A6000's, in 4 bits. It's nice, but $10K in GPUs buys a lot of subscriptions. Plus some of the strongest LLMs are now free (Grok, DeepSeek) for web use.
No one who is using this for home use cares about anything except batch size 1 sequence size 1.
What if you're doing bulk inference? The efficiency and throughput of bs=1 s=1 is truly abysmal.
People want to talk to their computer, not service requests for a thousand users.
For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).

Anything in between suffers.

Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.
Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.