Hacker News new | ask | show | jobs
by HarHarVeryFunny 497 days ago
Latent / embedding-space reasoning seems a step in the right direction, but building recurrence into the model while still relying on gradient descent (i.e. BPTT) to train it seems to create more of a problem (training inefficiency) than it solves, especially since they still end up externally specifying the number of recurrent iterations (r=4, 8, etc) for a given inference. Ideally having recurrence internal to the model would allow the model itself to decide how long to iterate for before outputting anything.
3 comments

While not the main focus, see Section 6.1 and Figure 10 for a simple adaptative exit strategy for inference.

I imagine that they choose a fixed number of recurrent iterations during training for parallelization purposes. Not depending on the previous step to train the next is the main revolution about transformers vs LSTM (plus the higher internal bandwidth). But I agree that it might not be the most efficient model to train due to all that redundant work at large r.

> Latent / embedding-space reasoning seems a step in the right direction

Might be good for reasoning, but it's terrible for interpretation / AI-safety.

Why is it any different to do 4 recurrent passes than having a model that is 4x deeper?
Running one layer 4 times should fetch the weights of that layer once. Running 4 layers makes you fetch 4x parameters.

The recurrent approach is more efficient when memory bandwidth is the bottleneck. They talk about it in the paper.

Yeah, understood. I'm excited for the reduction in parameter count that will come when this is taken up in major models.

I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.

You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.

I guess the most interpretable is to have as shallow a model as possible, but with longer cot. It would be quite interesting seeing the trade-off between the two. Though, unfortunately, deeper is probably better.
> seems a step in the right direction

I can’t see why. I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens. And the downsides are obvious.

> externally specifying the number of recurrent iterations

Yeah this seems wrong to me. At least with RL training you saw that the length of the CoT decreased dramatically before climbing again, as the model became more proficient.

> I can’t see why

It just provides a bigger representation space, and seems more like what we do given that many people don't have an inner dialog, and some think pictorially.

It seems it could allow reasoning over superpositions of concepts, if such things exist internal to the model (but presumably not at the edge were they need to be decodable into specific tokens).

> I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens.

Efficiency. The written language is extremely inefficient. By running through whole concepts at a time instead of parts of a word the reasoning time will be much more concise.

If we're talking conscious thought, millions of simultaneously firing neurons to form words. If we're unconscious intelligence, it's closer to latent space. A lot of intelligence that can't be articulated.
(citation needed) It sounds fun and all, but we barely have any connection between human brain and llms as they exist today.
We need to reboot Bryan Cantrill's "Don't anthropomorphize the lawn mower" talk with a new edition titled "Don't anthropomorphize the internet document simulator"
Identifying scheming in the latent streams would be harder as you would have an extra layer of obfuscation between you and the model’s reasoning.