|
|
|
|
|
by lonk11
488 days ago
|
|
Running one layer 4 times should fetch the weights of that layer once. Running 4 layers makes you fetch 4x parameters. The recurrent approach is more efficient when memory bandwidth is the bottleneck. They talk about it in the paper. |
|
I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.
You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.