| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dot_treo 35 days ago
	I would probably treat the (3 GatedDeltaNet + 1 GatedAttention) Blocks as one transformer block, when generating next steps one would therefore use the kv cache for the gated attention and skip the entire delta nets.