| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by HarHarVeryFunny 497 days ago
	Latent / embedding-space reasoning seems a step in the right direction, but building recurrence into the model while still relying on gradient descent (i.e. BPTT) to train it seems to create more of a problem (training inefficiency) than it solves, especially since they still end up externally specifying the number of recurrent iterations (r=4, 8, etc) for a given inference. Ideally having recurrence internal to the model would allow the model itself to decide how long to iterate for before outputting anything.

3 comments

Manabu-eo 496 days ago

While not the main focus, see Section 6.1 and Figure 10 for a simple adaptative exit strategy for inference.

I imagine that they choose a fixed number of recurrent iterations during training for parallelization purposes. Not depending on the previous step to train the next is the main revolution about transformers vs LSTM (plus the higher internal bandwidth). But I agree that it might not be the most efficient model to train due to all that redundant work at large r.

link

thomasahle 497 days ago

> Latent / embedding-space reasoning seems a step in the right direction

Might be good for reasoning, but it's terrible for interpretation / AI-safety.

link

Tostino 496 days ago

Why is it any different to do 4 recurrent passes than having a model that is 4x deeper?

link

lonk11 496 days ago

Running one layer 4 times should fetch the weights of that layer once. Running 4 layers makes you fetch 4x parameters.

The recurrent approach is more efficient when memory bandwidth is the bottleneck. They talk about it in the paper.

link

Tostino 496 days ago

Yeah, understood. I'm excited for the reduction in parameter count that will come when this is taken up in major models.

I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.

You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.

link

thomasahle 496 days ago

I guess the most interpretable is to have as shallow a model as possible, but with longer cot. It would be quite interesting seeing the trade-off between the two. Though, unfortunately, deeper is probably better.

link

janalsncm 497 days ago

> seems a step in the right direction

I can’t see why. I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens. And the downsides are obvious.

> externally specifying the number of recurrent iterations

Yeah this seems wrong to me. At least with RL training you saw that the length of the CoT decreased dramatically before climbing again, as the model became more proficient.

link

HarHarVeryFunny 497 days ago

> I can’t see why

It just provides a bigger representation space, and seems more like what we do given that many people don't have an inner dialog, and some think pictorially.

It seems it could allow reasoning over superpositions of concepts, if such things exist internal to the model (but presumably not at the edge were they need to be decodable into specific tokens).

link

viraptor 497 days ago

> I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens.

Efficiency. The written language is extremely inefficient. By running through whole concepts at a time instead of parts of a word the reasoning time will be much more concise.

link

jonathanrmumm 496 days ago

If we're talking conscious thought, millions of simultaneously firing neurons to form words. If we're unconscious intelligence, it's closer to latent space. A lot of intelligence that can't be articulated.

link

viraptor 496 days ago

(citation needed) It sounds fun and all, but we barely have any connection between human brain and llms as they exist today.

link

porridgeraisin 496 days ago

We need to reboot Bryan Cantrill's "Don't anthropomorphize the lawn mower" talk with a new edition titled "Don't anthropomorphize the internet document simulator"

link

bcantrill 495 days ago

One step ahead of you![0]

[0] https://www.youtube.com/watch?v=bQfJi7rjuEk (slides: https://speakerdeck.com/bcantrill/intelligence-is-not-enough...)

link

ckrapu 497 days ago

Identifying scheming in the latent streams would be harder as you would have an extra layer of obfuscation between you and the model’s reasoning.

link