Hacker News new | ask | show | jobs
by qeternity 921 days ago
Yeah I call BS on this. This does nothing to address the main issues with autoregressive transformer models (memory bandwidth).

GPU compute units are mostly sitting idle these days waiting for chip cache to receive data fr VRAM.

This does nothing to solve that.

2 comments

You can amortize memory loading with large continuous batching. I imagine more compute would help the problem for certain workloads like speculative decoding
Batching helps throughput and anyone running in production will be doing batching.

But it's not free, and still comes at a cost of per-stream latency.

Speculative decoding seems less effective in practice than in theory.

Not exactly idle but only at around 30% utilization on average (measured on a ~900 GPU cluster over ~25 days)
If it's at 30% utilization then it's "mostly idle".
I agree. I am surprised that many folks, not you of course, think that is okay.