| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brucethemoose2 1061 days ago

> LLMs are GPU compute-bound.

From what I understand, they are severely bandwidth bound at a GPU batch size of 1. Even llama.cpp is fairly RAM speed bound on a CPU with much less compute than a GPU.

It's just that batching is quite inefficient without an implementation like this: https://www.anyscale.com/blog/continuous-batching-llm-infere...