|
|
|
|
|
by Tostino
489 days ago
|
|
Yeah, understood. I'm excited for the reduction in parameter count that will come when this is taken up in major models. I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task. You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent. |
|