Hacker News new | ask | show | jobs
by robrenaud 890 days ago
Removing the exponential allows some linear algebra based tricks. It makes the state space linear. Linearity allows a kind of running sum, where the state space at time T is quickly computable from the state space at time T-1.

That linearity model simplification has model expressiveness costs, which is why they don't fit the training data as well.

1 comments

Wonder if it'd ve possible to have our cake and eat it too by treating layer outputs as log(state space) in that case?