| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by uoaei 59 days ago
	Exactly, even in the throes of today's wacky economic tides, storage is still cheap. Write the model state immediately after the N context messages in cache to disk and reload without extra inference on the context tokens themselves. If every customer did this for ~3 conversations per user you still would only need a small fraction of a typical datacenter to house the drives necessary. The bottleneck becomes architecture/topology and the speed of your buses, which are problems that have been contended with for decades now, not inference time on GPUs.

1 comments

jeremyjh 59 days ago

This has nothing to do with the cost of storage. Surprisingly, you are not better informed than Anthropic on the subject of serving AI inference models.

A sibling comment explains:

https://news.ycombinator.com/item?id=47886200

link

uoaei 57 days ago

They don't cache model state to disk. I am proposing they do.

link

jeremyjh 57 days ago

I’m proposing that you should educate yourself on the subject of LLM KV context caching.

link