Hacker News new | ask | show | jobs
by Bolwin 18 days ago
No one is producing one output token though.

And using up gpus for that cache is a pretty big opportunity cost. I highly doubt it's done in vram. That would be insane for the one hour caches.

So its memory + the time it takes to unload/load into vram + the extra cost per output token

Is it a scam? Idk