Hacker News new | ask | show | jobs
by dot_treo 35 days ago
I would probably treat the (3 GatedDeltaNet + 1 GatedAttention) Blocks as one transformer block, when generating next steps one would therefore use the kv cache for the gated attention and skip the entire delta nets.