| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pqtyw 16 days ago

How is tok/s not a bottleneck I? I assume most people still use ai agents interactively rather than leaving them to do their own thing during the night.

I find anything below 50 tps or so entirely unusable...

Regardless its Apples to oranges anyway, inference is quite cheap for open weight models its just that Claude and OpenAI can charge very high margins compared to e.g. DeepSeek or various provider on OpenRouter since open models are a commodity.

3 comments

brianwawok 16 days ago

I startup 4 or so projects then go do other things for 4 hours. I don’t have enough energy to steer overnight, but I’m at least “semi afk” for daytime steering. So throughput is king for me, tokens per hour. Not latency or actual tokens per second.

link

smallerize 16 days ago

Running locally is even worse for this, because if you're running 4 jobs at once they just run at 1/4 speed. Not literally, you can make up some of the difference with batching, but you have limited resources instead of spreading your requests out on an API provider's nodes.

link

sweetjuly 16 days ago

Is interactive use for coding something that actually works today? With unsafe mode, even frontier hosted models are slow enough I end up just tabbing out to work on other tasks. It would need to be much faster if I am to sit and stare at it while it churns. Local models might be a lot slower but workflow-wise it doesn't change much for me.

link

cyanydeez 16 days ago

It's not a bottleneck if you care about the actual code.

link

pqtyw 16 days ago

I would expect the overwhelming majority of output tokens would not be the actual code but used for analysis, reasoning, testing and iteration. If you only use the agent for autocomplete then yes, the calculation is probably different.

link

cyanydeez 16 days ago

yea, and understanding that too is important. the idea you dont need to read code or analysis seems to align with the depwndcy addiction being shoved in thw pipe.

link