Hacker News new | ask | show | jobs
by brucethemoose2 1014 days ago
> LLMs are GPU compute-bound.

From what I understand, they are severely bandwidth bound at a GPU batch size of 1. Even llama.cpp is fairly RAM speed bound on a CPU with much less compute than a GPU.

It's just that batching is quite inefficient without an implementation like this: https://www.anyscale.com/blog/continuous-batching-llm-infere...