Hacker News new | ask | show | jobs
by ipieter 378 days ago
This is an interesting blogpost. While the general conclusion ("We need batching") is true, inference of mixture of experts (MoE) models is actually a bit more nuanced.

The main reason we want big batches is because LLM inference is not limited by the compute, but my loading every single weight out of VRAM. Just compare the number of TFLOPS of an H100 with the memory bandwidth, there's basically room for 300 FLOP per byte loaded. So that's why we want big batches: we can perform a lot of operations per parameter/weight that we load from memory. This limit is often referred to as the "roofline model".

As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.

So what MoE allows is expert parallelism, where different nodes keep different experts in memory and don't need to communicate as much between nodes. This only works if there are enough nodes to keep all experts in VRAM and have enough overhead for other stuff (KV cache, other weights, etc). So naturally the possible batch size becomes quite large. And of course you want to maximize this to make sure all GPUs are actually working.

4 comments

You could load different "experts" in a round-robin way on a single node and only aggregate "batches" opportunistically, when you just have multiple requests in-flight that all happen to rely on the same "expert". The difference being that instead of "batches", you would only really have queues. Of course this would come with a sizeable increase in latency, but that's acceptable for many applications (such as for "deep research" workflows)
This is very much like Erlang's actor model. The same compute can be run in parallel, or managed via queues. With Erlang's strong support for FFI and process control, I wonder if it's being used as a dispatcher for these sorts of workloads.
> As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.

Inference works by computing layers and then have a very small vector that you send to the next layer as input. When a model does not fit in a single GPU, you just divide it into layers and send the vector over a fabric to the GPU holding the next layer. The transfer happens so quickly that there is a negligible amount of idle time and then the next layer can be computed. The fastest inference on the planet at Cerebras uses this technique to do 2500T/sec on Llama 4 Maverick.

Groq and Cerebras both take a big chip approach to architecture and, at least in the case of Groq, they only make economic sense under high batch loads.

https://x.com/swyx/status/1760065636410274162?s=46

There is nothing big about Groq’s chips. Their individual chips have only 230 MB RAM. Unlike Cerebras, which can load multiple layers into a single chip, grok must divide a layer across many chips.
Distributing inference per layer, instead of splitting each layer across gpus, is indeed another approach, called pipeline parallelism. However, per batch there is less compute (only 1 gpu at a time), so inference is slower. In addition, the orchestration of starting the next batch on gpu #0 while gpu #1 starts is quite tricky. For this reason, tensor parallelism as I described is way more common in LLM inference.
In what software? llama.cpp and others divide things by layers.
could such a network with all its nodes and weights be deployed to an analog circuit and be superfast?
Do you mean something like this? https://www.etched.com/
Please go into more detail about this proposal, this piqued my interest in a really strange way.
The idea is to replicate the weights of the network in the electronics. Somehow like our brains work? This way an analog input signal could lead to a neural network processed output signal without the digital emulation on an gpu. As this is very much simplified, the question is if this could work for modern llms?
Suddenly "temperature" parameter starts making sense

(If you ever tried fine-tuning an analog circuit, you'll know how finicky the process due to the environment, including temperature)

haha very true!
And this is the investment case for AMD, models fit entirely in a single chassis, and side benefit: less tariffed network equipment to interconnect compute. Map/reduce instead of clustered compute.

Edit: when downvoting, please offer some insight why you disagree

How is the a unique advantage for AMD?
AMD is consistently stacking more HBM.

  H100 80GB HBM3
  H200 141GB HBM3e
  B200 192GB HBM3e

  MI300x 192GB HBM3
  MI325x 256GB HBM3e
  MI355x 288GB HBM3e
This means that you can fit larger and larger models into a single node, without having to go out over the network. The memory bandwidth on AMD is also quite good.
It really does not matter how much memory AMD has if the drivers and firmware are unstable. To give one example from last year:

https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...

They are currently developing their own drivers for AMD hardware because of the headaches that they had with ROCm.

"driver" is such a generic word. tinygrad works on mi300x. If you want to use it, you can. Negates your point.

Additionally, ROCm is a giant collection of a whole bunch of libraries. Certainly there are issues, as with any large collection of software, but the critical thing is whether or not AMD is responsive towards getting things fixed.

In the past, it was a huge issue, AMD would routinely ignore developers and bugs would never get fixed. But, after that SA article, Lisa lit a fire under Anush's butt and he's taking ownership. It is a major shift in the entire culture at the company. They are extremely responsive and getting things fixed. You can literally tweet your GH issue to him and someone will respond.

What is true a year ago isn't today. If you're paying attention like I am, and experiencing it first hand, things are changing, fast.

I have been hearing this about AMD/ATI drivers for decades. Every year, someone says that it is fixed, only for new evidence to come out that they are not. I have no reason to believe it is fixed given the history.

Here is evidence to the contrary: If ROCm actually was in good shape, tinygrad would use it instead of developing their own driver.

That was last year Mi300x firmware and software have gotten much better since then
Unfortunately, AMD and ATI before it have had driver quality issues for decades; and both they and their fans have claimed that they have solved the problems every year since.

Even if they have made progress, I doubt that they have reached parity with Nvidia. I have had enough false hope from them that I am convinced that the only way that they will ever improve their drivers if they let another group write the drivers for them.

Coincidentally, Valve has been developing the Vulkan driver used by SteamOS and other Linux distributions, which is how SteamOS is so much better than Windows. If AMD could get someone else to work on improving their GPGPU support, we would likely see it become quite good too. Until then, I have very low expectations.

So the MI300x has 8 different memory domains, and although you can treat it as one flat memory space, if you want to reach their advertised peak memory bandwidth you have to work with it like an 8-socket board.
MI355X isn't out yet, and the upcoming B300 also has 288GB HBM3e
June 12th.

B300 is Q4 2025.

Yes, they keep leapfrogging each other. AMD is still ahead in vram.

> when downvoting, please offer some insight why you disagree

And remind that (down)voting is not for (dis)agreement.