Hacker News new | ask | show | jobs
by mikeayles 17 days ago
So for people wondering if it can be used to accelerate LLM inference, sadly not.

I've been trying to hit 100,000tokens/s with a 3.28m dumb model, and even this is an order of magnitude too large to benefit.

It appears to be focussed more on latency, than throughput. Happy to be corrected?

3 comments

When aiming for 100k tok/s, you would still have CUDA overheads (on the order of microseconds) -- which might become the bottleneck, even if you do everything else right with the inference architecture. How are you planning to overcome that?

EDIT: Oh, on second read, do you mean you're running the model on an FPGA?

You might be conflating throughput with latency. 100k tok/s is very different to 1 tok/10us.
When doing auto regressive inference, how often do you do a CUDA kernel call? What is the main bottleneck at the throughputs you're operating?
You're correct that this work is not very applicable for LLMs and that the focus here is primarily on latency.
Was anyone thinking this?