| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by storystarling 141 days ago
	The killer app here is likely LLM inference loops. Currently you pay a PCIe latency penalty for every single token generated because the CPU has to handle the sampling and control logic. Moving that logic to the GPU and keeping the whole generation loop local avoids that round trip, which turns out to be a major bottleneck for interactive latency.

2 comments

radarsat1 141 days ago

I don't know what the pros are doing but I'd be a bit shocked if it isn't already done this way in real production systems. And it doesn't feel like porting the standard library is necessary for this, it's just some logic.

link

storystarling 141 days ago

Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.

link

radarsat1 140 days ago

I was thinking mainly about the standard AR loop, yes I can see that grammars would make it a bit more complicated especially when considering batching.

link

tucnak 141 days ago

Turns out how? Where are the numbers?

link

storystarling 141 days ago

It is less about the raw transfer speed and more about the synchronization and kernel launch overheads. If you profile a standard inference loop with a batch size of 1 you see the GPU spending a lot of time idle waiting for the CPU to dispatch the next command. That is why optimizations like CUDA graphs exist, but moving the control flow entirely to the device is the cleaner solution.

link

tucnak 141 days ago

I'm not convinced. (A bit of advice: if you wish to make a statement about performance, always start by measuring things. Then when somebody asks you for proof/data, you would already have it.) If what you're saying were true, it would be a big deal, except unfortunately it isn't.

Dispatch has overheads, but it's largely insignificant. Where it otherwise would be significant:

1. Fused kernels exist

2. CUDA graphs (and other forms of work-submission pipelining) exist

link

saagarjha 141 days ago

CUDA graphs are pretty slow at synchronizing things.

link