| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by littlestymaar 39 days ago
	So, it's D-Flash but at each transformer layer and share the KV cache of the original model? Very smart!

1 comments

foobar10000 39 days ago

Kindof yeah - predictivity is a question though for larger layers - when trying to scale this up. But yeah, this is a "95% predictor in latent space is a 7x improvement in speed if done right" approach.

link