|
|
|
|
|
by jacknotold
72 days ago
|
|
Makes sense. Random initial states for generation is interesting because it adds diversity at the source. We tried something related with the alpha parameter (scales the learned state magnitude) and found the optimal value differs 10x between architectures: 0.07 for GatedDeltaNet vs 0.65 for Mamba-2. Too large and generation degrades, too small and the state washes out before it affects anything. |
|