| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jackcook 888 days ago
	Thank you for the kind words! I think it’s mostly to reduce complexity during training. Here’s an excerpt from page 9 of the Mamba paper: “We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”

1 comments

nlrk 888 days ago

when I read the paper I thought the idea was changing \Delta permits getting the model to learn different things over different time scales. As you quoted “the main source of improvement".

I don’t have an llm backround, just controls, so I might wrong.

link