Hacker News new | ask | show | jobs
by jacknotold 72 days ago
Makes sense. Random initial states for generation is interesting because it adds diversity at the source. We tried something related with the alpha parameter (scales the learned state magnitude) and found the optimal value differs 10x between architectures: 0.07 for GatedDeltaNet vs 0.65 for Mamba-2. Too large and generation degrades, too small and the state washes out before it affects anything.