|
|
|
|
|
by fchaubard
395 days ago
|
|
Layman Abstract: Transformers keep around all previous tokens for each generated token, so they take up ENORMOUS gpu memory and cost during inference. But humans do not, we page in / out of our small, fixed-size "working memory", keeping around only the important information of the past. RNNs are more like us, they compress all previous tokens into a small fixed-sized memory. However, we can't train them with legacy backprop through time (BPTT), because it doesnt scale and suffers exploding/vanishing gradients. So we discovered a 1992 zero order algorithm to replace BPTT, and not only does it scale amazingly well, in some cases, it trains 19x faster than BPTT! So maybe with this, RNNs can replace transformers? |
|