Hacker News new | ask | show | jobs
by asne11 743 days ago
"slot" is a processing unit. Either GPU or CPU. I believe `llama.c` is only CPU so I'm guessing 1 slot = 1 core (or thread)?
3 comments

It divides the context into smaller "slots", so it can process requests concurrently with continuous batching. See also: https://github.com/ggerganov/llama.cpp/tree/master/examples/...
Llama.cpp can run on CPU, on GPU, or in mixed mode (some layers run on CPU and some on GPU if you don't have enough VRAM).
llama.cpp is not CPU only…