| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chpatrick 184 days ago
	It's not n sometimes, k tokens some other times. LLMs have fixed context windows, you just sometimes have less text so it's not full. They're pure functions from a fixed size block of text to a probability distribution of the next character, same as the classic lookup table n gram Markov chain model.

1 comments

famouswaffles 184 days ago

1. A context limit is not a Markov order. An n-gram model’s defining constraint is: there exists a small constant k such that the next-token distribution depends only on the last k tokens, full stop. You can't use a k-trained markov model on anything but k tokens, and each token has the same relationship with each other regardless. An LLM’s defining behavior is the opposite: within its window it can condition on any earlier token, and which tokens matter can change drastically with the prompt (attention is content-dependent). “Window size = 8k/128k” is not “order k” in the Markov sense; it’s just a hard truncation boundary.

2. “Fixed-size block” is a padding detail, not a modeling assumption. Yes, implementations batch/pad to a maximum length. But the model is fundamentally conditioned on a variable-length prefix (up to the cap), and it treats position 37 differently from position 3,700 because the computation explicitly uses positional information. That means the conditional distribution is not a simple stationary “transition table” the way the n-gram picture suggests.

3. “Same as a lookup table” is exactly the part that breaks. A classic n-gram Markov model is literally a table (or smoothed table) from discrete contexts to next-token probabilities. A transformer is a learned function that computes a representation of the entire prefix and uses that to produce a distribution. Two contexts that were never seen verbatim in training can still yield sensible outputs because the model generalizes via shared parameters; that is categorically unlike n-gram lookup behavior.

I don't know how many times I have to spell this out for you. Calling LLMs markov chains is less than useless. They don't resemble them in any way unless you understand neither.

link

chpatrick 184 days ago

I think you're confusing Markov chains and "Markov chain text generators". A Markov chain is a mathematical structure where the probabilities of going to the next state only depend on the current state and not the previous path taken. That's it. It doesn't say anything about whether the probabilities are computed by a transformer or stored in a lookup table, it just exists. How the probabilities are determined in a program doesn't matter mathematically.

link

saithound 184 days ago

Just a heads-up: this is not the first time somebody has to explain Markov chains to famouswaffles on HN, and I'm pretty sure it won't be the last. Engaging further might not be worth it.

link

famouswaffles 184 days ago

I did not even remember you and had to dig to find out what you were on about. Just a heads up, if you've had a previous argument and you want to bring that up later then just speak plainly. Why act like "somebody" is anyone but you?

My response to both of you is the same.

LLMs do depend on previous events, but you say they don't because you've redefined state to include previous events. It's a circular argument. In a Markov chain, state is well defined, not something you can insert any property you want to or redefine as you wish.

It's not my fault neither of you understand what the Markov property is.

link

chpatrick 184 days ago

By that definition n-gram Markov chain text generators also include previous state because you always put the last n grams. :) It's exactly the same situation as LLMs, just with higher, but still fixed n.

link

famouswaffles 183 days ago

We've been through this. The context of a LLM is not fixed. Context windows =/ n gram orders.

They don't because n gram orders are too small and rigid to include the history in the general case.

I think srean's comment up the thread is spot on. This current situation where the state can be anything you want it to be just does not make a productive conversation.

link

famouswaffles 184 days ago

'A Markov chain is a mathematical structure where the probabilities of going to the next state only depend on the current state and not the previous path taken.'

My point, which seems so hard to grasp for whatever reason is that In a Markov chain, state is a well defined thing. It's not a variable you can assign any property to.

LLMs do depend on the previous path taken. That's the entire reason they're so useful! And the only reason you say they don't is because you've redefined 'state' to include that previous path! It's nonsense. Can you not see the circular argument?

The state is required to be a fixed, well-defined element of a structured state space. Redefining the state as an arbitrarily large, continuously valued encoding of the entire history is a redefinition that trivializes the Markov property, which a Markov chain should satisfy. Under your definition, any sequential system can be called Markov, which means the term no longer distinguishes anything.

link

chpatrick 182 days ago

They only have the previous path in as much as n-gram Markov text generators have the previous path.

link

lelanthran 182 days ago

> An n-gram model’s defining constraint is: there exists a small constant k such that the next-token distribution depends only on the last k tokens, full stop.

I don't necessarily agree with GP, but I also don't think that a markov chain and markov generator definitions include the word "small".

That constant can be as large as you need it to be.

link