Y
Hacker News
new
|
ask
|
show
|
jobs
by
littlestymaar
39 days ago
So, it's D-Flash but at each transformer layer and share the KV cache of the original model? Very smart!
1 comments
foobar10000
39 days ago
Kindof yeah - predictivity is a question though for larger layers - when trying to scale this up. But yeah, this is a "95% predictor in latent space is a 7x improvement in speed if done right" approach.
link