| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cold_harbor 69 days ago
	their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share.

2 comments

zozbot234 69 days ago

That's also a game changer for local inference. It unlocks long contexts, batched inference and storing the KV cache to disk on ordinary consumer platforms.

link

vitorsr 69 days ago

Yes. The discount was most likely a "post-market trial" of how efficient the caching works for the new generation models.

link

trollbridge 69 days ago

I've "adjusted" my workflows now to use the cache. (Basically read all the files in your project very early on in your session, etc., simple stuff like that.)

Nearly all requests are cached now. It's amazing.

link