| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gaeld 18 days ago

Great points, let me clarify:

- model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s

- reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling.

The hard part comes we you try to be faster than that: these frameworks won't scale higher just by adding GPUs or using faster GPUs. There is a "glass ceiling" due to microseconds lost everywhere in the stack (grid syncs, inter-GPU comms, kernel launches, CPU sampling, etc.).

All our work at Kog is about removing these bottlenecks.

2 comments

dr_kiszonka 18 days ago

Thank you for explaining. Do you think there are still opportunities for stack optimizations to meaningfully speed up inference on single consumer-grade GPUs?

link

gaeld 18 days ago

I'm sure there are, and I really hope we can work on consumer-grade GPUs at some point.

It should be possible to apply the same methodology (digging deep into the hardware details to understand all its little characteristics, and rethinking the inference stack around that).

link

bcjdjsndon 18 days ago

That doesn't clarify anything lol. It's a bit click baity.

link