Hacker News new | ask | show | jobs
by foobar10000 39 days ago
Minor nit re[2]: for agentic workloads that are actually worth money - i.e., claude code and similar, things are either prefill-bound - which this does not help - or more importantly tps/user bound (at 150k+ context windows) - you want your big magic model to emit 200 tps/user. This is why Nvidia bought Groq (now LPU) and what Cerebras is trying to do, etc, etc. So for the stuff that makes money in the field - GPUs are not really compute bound once context lengths are large - but still memory transfer bound (may be KV-cache transfer, may be HBM->SRAM-on-chip, etc..)
1 comments

> i.e., claude code and similar, things are either prefill-bound

When accounting for prefix caching, this greatly accelerates each turn. Barring large file reads, prefill still isn't the bottleneck vs. decoding reasoning tokens. Script-writing too.

This is especially true during exploration phases when traversing through directory trees and grepping files, you're talking about a few hundred tokens/turn.