| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by qeternity 921 days ago

Yeah I call BS on this. This does nothing to address the main issues with autoregressive transformer models (memory bandwidth).

GPU compute units are mostly sitting idle these days waiting for chip cache to receive data fr VRAM.

This does nothing to solve that.

2 comments

dramlord 921 days ago

You can amortize memory loading with large continuous batching. I imagine more compute would help the problem for certain workloads like speculative decoding

link

qeternity 919 days ago

Batching helps throughput and anyone running in production will be doing batching.

But it's not free, and still comes at a cost of per-stream latency.

Speculative decoding seems less effective in practice than in theory.

link

pavelstoev 921 days ago

Not exactly idle but only at around 30% utilization on average (measured on a ~900 GPU cluster over ~25 days)

link

sp332 921 days ago

If it's at 30% utilization then it's "mostly idle".

link

pavelstoev 920 days ago

I agree. I am surprised that many folks, not you of course, think that is okay.

link