Hacker News new | ask | show | jobs
by froh 6 hours ago
> GPUs are extremely underutilized if you launch just 1 generation stream

why is that? b/c the thing is waiting for the hoooman and idling? or some parallelizable interleaving steps?

I have no intuition yet how this works under the hood.

1 comments

Some of the inference engines can process multiple requests in parallel more efficiently than doing them sequentially. Not sure of the exact mechanism but e.g. llama.cpp's llama-server can do this (you tell it the number of slots to have when starting, then fire HTTP requests at it and it batches them together when it can).

Waiting for the hooman (or tool calls) won't help either, of course.