Hacker News new | ask | show | jobs
by magicalhippo 260 days ago
Makes me think once again about the similarity to Finite Impulse Response[1] filters (traditional LLMs) and Infinite Impulse Response[2] filters (recursive models). Not that it's a very good or original analogy.

Anyway, with FIR you typically need many, many times the coefficients to get similar filter cutoff performance as a what few IIR coefficients can do.

You can convert a IIR to a FIR using for example the window design method[3], where if you use a rectangular window function you essentially unroll the recursion but stop after some finite depth.

Similarly it seems unrolling the TRM you end up with the traditional LLM architecture of many repeated attention+ff blocks, minus the global feedback part. And unlike a true IIR, the TRM does implement a finite cut-off, so in that sense is more like a traditional FIR/LLM than the structure suggest.

So, would perhaps be interesting to compare the TRM network to a similarly unrolled version.

Then again, maybe this is all mad ramblings from a sleep deprived mind.

[1]: https://en.wikipedia.org/wiki/Finite_impulse_response

[2]: https://en.wikipedia.org/wiki/Infinite_impulse_response

[3]: https://en.wikipedia.org/wiki/Finite_impulse_response#Window...

1 comments

Deep Equilibrium Models

>We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation.

https://arxiv.org/abs/1909.01377

What's fascinating about deep equilibrium models is that you only need a single layer to be equivalent to a conventional deep neural network with multiple layers. Recursion is all you need! The model automatically uses more iterations for difficult tasks and fewer iterations for easy tasks.

Thanks something like that was going through my mind, nice to get a good reference for it. Any insights on why this is not a more popular approach? Maybe it's too difficult for a single layer to scale.

I read a paper recently on something similar for diffusion, called Fixed Point Diffusion Models. They specialize the first and last layers but recurse the middle layer some number of times until convergence.

Considering how a Transformer is a residual model, each layer must be adding more and more precise adjustments to the selected token. It makes a lot of sense to think of this like the steps of an optimisation method.