Hacker News new | ask | show | jobs
by littlestymaar 39 days ago
So, it's D-Flash but at each transformer layer and share the KV cache of the original model? Very smart!
1 comments

Kindof yeah - predictivity is a question though for larger layers - when trying to scale this up. But yeah, this is a "95% predictor in latent space is a 7x improvement in speed if done right" approach.