| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mikeayles 17 days ago

So for people wondering if it can be used to accelerate LLM inference, sadly not.

I've been trying to hit 100,000tokens/s with a 3.28m dumb model, and even this is an order of magnitude too large to benefit.

It appears to be focussed more on latency, than throughput. Happy to be corrected?

3 comments

ssivark 16 days ago

When aiming for 100k tok/s, you would still have CUDA overheads (on the order of microseconds) -- which might become the bottleneck, even if you do everything else right with the inference architecture. How are you planning to overcome that?

EDIT: Oh, on second read, do you mean you're running the model on an FPGA?

link

taneq 16 days ago

You might be conflating throughput with latency. 100k tok/s is very different to 1 tok/10us.

link

ssivark 16 days ago

When doing auto regressive inference, how often do you do a CUDA kernel call? What is the main bottleneck at the throughputs you're operating?

link

ag2718 17 days ago

You're correct that this work is not very applicable for LLMs and that the focus here is primarily on latency.

link

ai_fry_ur_brain 17 days ago

Was anyone thinking this?

link