| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jacknotold 119 days ago
	Makes sense. Random initial states for generation is interesting because it adds diversity at the source. We tried something related with the alpha parameter (scales the learned state magnitude) and found the optimal value differs 10x between architectures: 0.07 for GatedDeltaNet vs 0.65 for Mamba-2. Too large and generation degrades, too small and the state washes out before it affects anything.