Y
Hacker News
new
|
ask
|
show
|
jobs
by
dot_treo
35 days ago
I would probably treat the (3 GatedDeltaNet + 1 GatedAttention) Blocks as one transformer block, when generating next steps one would therefore use the kv cache for the gated attention and skip the entire delta nets.