| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hackinthebochs 5 days ago
	>The point is that the output is text that is statistically correlated with the input. But we can simply note that this description applies to any machine learning algorithm. Yet LLMs are lightyears better than, say, Markov chains. What people are after is something that elucidates the features of LLMs that allow them to be so productive over what came before.

1 comments

Borealid 4 days ago

There is absolutely nothing stopping someone from distilling a modern LLM into a very effective Markov chain. The physical size of the model would explode because a context window containing C tokens of size B would need B^C Markov prior states, but the actual output would be a deterministic version of the LLM's with top-n n=1 sampling.

In other words, a Markov chain and a Transformer model are exactly equivalent in power (there is NOTHING that can be done with one and not the other). The Transformer model is just better pretrained and a more efficient compression/generation.

link

hackinthebochs 4 days ago

>In other words, a Markov chain and a Transformer model are exactly equivalent in power

Nonsense. Markov chains treat the past context as a single unit, an N-tuple with no internal structure. LLMs leverage the internal structure of the context which allows a large class of generalization that Markov chains necessarily miss.

link

Borealid 4 days ago

No, not nonsense.

Both are a lookup table whose key is the entire context window and whose value is a probability distribution for what the next token should be.

You can say the choice of probability distribution in the value is "leveraging the internal structure of the context" or not, but the same tokens in two different orders are two different lookup keys and saying it's impossible to achieve some result with a Markov chain is factually incorrect.

https://arxiv.org/pdf/2410.02724 describes the equivalence formally.

link

hackinthebochs 4 days ago

That paper doesn't prove the equivalence of Transformers and Markov chains, it uses Markov chains as a theoretical model to understand the behavior of Transforms. The expressivity of the model matters, and Transformers just are more expressive than Markov chains.

>but the same tokens in two different orders are two different lookup keys

This is necessarily true for Markov chains and not necessarily true for Transformers. Transformers learn invariance over certain kinds of semantically irrelevant transformations. The Markov chain simply has to learn each input variant independently, resulting in an explosion of state space and data requirements compared to the functionally equivalent transformer. Expressive power matters.

I really don't get people's love for saying X is "just" Y (it's just a Markov chain, it's just a Kernel method). It's a strange pathology to focus on the superficial similarity while downplaying the boost in expressive power from where the models diverge.

link

Borealid 2 days ago

The paper presents a constructive transformation from any finite-input (finite vocab, bounded length) transformer to an equivalent Markov chain.

Do you have some concrete example of a transformer that cannot be represented as a mapping from inputs to probability distribution of outputs?

I say they're equivalent because it is possible to losslessly convert one to the other by wasting massive amounts of disk space and time.

As a second example proving the point, imagine you sampled a transformer's output for a certain context 85 trillion times, and put the output token frequencies in a table. Repeat for all possible inputs (of which there are a finite number). Then you built literally a hash map looking up the context and spitting out the distribution. That certainly is NOT a transformer any more (it's a hash map!!!), but the output approaches indistinguishability as the sample count increases - if the transformer is reasoning, so is the hash map built from it.

I'm not talking hot air here, they really are provably equivalent because a 1:1, onto mapping exists.

For the record, "X is more expressive than Y" means "there exists at least one thing that Y cannot represent and X can". Nothing to do with size or time.

link

hackinthebochs 1 day ago

>I say they're equivalent because it is possible to losslessly convert one to the other by wasting massive amounts of disk space and time.

There is a classical algorithm for every quantum algorithm if you're willing to waste a massive amount of space and time. There is a finite-state automata that can recognize any string some Turing machine can recognize. Yet we recognize these as distinct classes of computation. Mathematicians can get away with ignoring the tractability of finding an object with such and such properties. The rest of us can't.

Sure, there is a formal equivalence between LLMs and Markov chains, and this formal equivalence is useful for analysis. But this equivalence is not a constraint on the nature of the computations LLMs are doing. The formal equivalence does not mean that LLMs are "just predicting the next token". A probability distribution is a formal characterization of the statistical relationships between inputs and outputs. But this formalization does not undermine potentially further structure underlying the probability distribution (e.g. a deterministic mapping from inputs to outputs).

>if the transformer is reasoning, so is the hash map built from it.

Definitely not. "Formal" reasoning is making deductions based on the "form" or shape of some statement. In other words, transitioning from some token sequence to another sequence in virtue of the semantic structure of the token sequence (as opposed to its semantic content). Thus a necessary condition for reasoning is the ability to inspect the structure of the input rather than see it as a formless blob. Transformers can plausibly do this; lookup tables, Markov chains, etc necessarily cannot.

>For the record, "X is more expressive than Y" means "there exists at least one thing that Y cannot represent and X can".

Maybe expressive is the wrong word. But when a model has to wait for someone else to do the work then copy the answer, I call bullshit on it being (computationally) equivalent.

link