| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Borealid 5 days ago

The paper presents a constructive transformation from any finite-input (finite vocab, bounded length) transformer to an equivalent Markov chain.

Do you have some concrete example of a transformer that cannot be represented as a mapping from inputs to probability distribution of outputs?

I say they're equivalent because it is possible to losslessly convert one to the other by wasting massive amounts of disk space and time.

As a second example proving the point, imagine you sampled a transformer's output for a certain context 85 trillion times, and put the output token frequencies in a table. Repeat for all possible inputs (of which there are a finite number). Then you built literally a hash map looking up the context and spitting out the distribution. That certainly is NOT a transformer any more (it's a hash map!!!), but the output approaches indistinguishability as the sample count increases - if the transformer is reasoning, so is the hash map built from it.

I'm not talking hot air here, they really are provably equivalent because a 1:1, onto mapping exists.

For the record, "X is more expressive than Y" means "there exists at least one thing that Y cannot represent and X can". Nothing to do with size or time.

1 comments

hackinthebochs 4 days ago

>I say they're equivalent because it is possible to losslessly convert one to the other by wasting massive amounts of disk space and time.

There is a classical algorithm for every quantum algorithm if you're willing to waste a massive amount of space and time. There is a finite-state automata that can recognize any string some Turing machine can recognize. Yet we recognize these as distinct classes of computation. Mathematicians can get away with ignoring the tractability of finding an object with such and such properties. The rest of us can't.

Sure, there is a formal equivalence between LLMs and Markov chains, and this formal equivalence is useful for analysis. But this equivalence is not a constraint on the nature of the computations LLMs are doing. The formal equivalence does not mean that LLMs are "just predicting the next token". A probability distribution is a formal characterization of the statistical relationships between inputs and outputs. But this formalization does not undermine potentially further structure underlying the probability distribution (e.g. a deterministic mapping from inputs to outputs).

>if the transformer is reasoning, so is the hash map built from it.

Definitely not. "Formal" reasoning is making deductions based on the "form" or shape of some statement. In other words, transitioning from some token sequence to another sequence in virtue of the semantic structure of the token sequence (as opposed to its semantic content). Thus a necessary condition for reasoning is the ability to inspect the structure of the input rather than see it as a formless blob. Transformers can plausibly do this; lookup tables, Markov chains, etc necessarily cannot.

>For the record, "X is more expressive than Y" means "there exists at least one thing that Y cannot represent and X can".

Maybe expressive is the wrong word. But when a model has to wait for someone else to do the work then copy the answer, I call bullshit on it being (computationally) equivalent.

link

Borealid 4 days ago

Just to make sure I've understood you... Are you arguing that with a set of identically-behaving black boxes, one could be "reasoning" and one could be "not reasoning", and a person would need to look inside the boxes at how they function to decide?

Remember, if the mapping from input to output is identical, there exists no test operating on the machines' output that can differentiate them. You can't tell from "conversing with" a machine whether it is or is not doing what you say around "inspecting" the input.

link

hackinthebochs 4 days ago

>Are you arguing that with a set of identically-behaving black boxes, one could be "reasoning" and one could be "not reasoning", and a person would need to look inside the boxes at how they function to decide?

Absolutely! Inside one of the black boxes could be an audio device replaying a tape. The other could be a person thinking and responding. The massive lookup table construct people like to reference is just another kind of recorder, it takes every possible conversation that could happen in some finite sequence of characters and produces the precomputed continuation on demand. No one ever asks where those conversations came from. If God has to imagine them in his mind, conversing with the lookup table is just conversing with God.

link

Borealid 4 days ago

Okay, understood. You are making a variant of the Chinese Room argument in which you allow some types of computer programs (but not others) to have reason/sentience. I'm not entirely sure what specific lines you're drawing between the programs (what makes a deterministic transformer with sampling temperature zero "not a recording" but a hash table "a recording"?) but that's not super important.

There is nothing wrong about having that philosophy, and I respect it, but personally I think if it's impossible to tell two things apart using any external observation there is not a meaningful difference between those two things. "Smells like a rose" and all that.

link