|
|
|
|
|
by imtringued
260 days ago
|
|
Deep Equilibrium Models >We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation. https://arxiv.org/abs/1909.01377 What's fascinating about deep equilibrium models is that you only need a single layer to be equivalent to a conventional deep neural network with multiple layers. Recursion is all you need! The model automatically uses more iterations for difficult tasks and fewer iterations for easy tasks. |
|
I read a paper recently on something similar for diffusion, called Fixed Point Diffusion Models. They specialize the first and last layers but recurse the middle layer some number of times until convergence.
Considering how a Transformer is a residual model, each layer must be adding more and more precise adjustments to the selected token. It makes a lot of sense to think of this like the steps of an optimisation method.