| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Mike_12345 1190 days ago
	GPT-4 is often overhyped and underhyped because few really understand it. It's not a Markov Chain or a fancy text predictor. It's a ~200 layer neural network that models a vast hierarchy of concepts through language. It has emergent properties that we don't yet understand.

2 comments

letitgo12345 1190 days ago

Where are you getting the 200 number from?

link

Mike_12345 1190 days ago

I must have hallucinated that. GPT-3 has 96 layers but they haven't disclosed the number of layers in GPT-4.

link

apidercondo 1189 days ago

Interesting how we are already starting to use the lingo in the rest of our lives.

link

golol 1189 days ago

it is a markov chain; At least the underlying decoder only transformer is.

link

Mike_12345 1189 days ago

GPT-4 disagrees:

GPT-3.5, like its predecessor GPT-3, is not a Markov chain. GPT-3.5 is based on the GPT (Generative Pre-trained Transformer) architecture, which is a type of neural network known as a Transformer. Transformers use self-attention mechanisms to process and generate text, allowing them to capture long-range dependencies and context in the input data.

On the other hand, a Markov chain is a stochastic model that describes a sequence of possible events, where the probability of each event depends only on the state attained in the previous event. While Markov chains can be used for simple text generation, they lack the ability to capture the complex relationships and long-range dependencies that GPT-3.5 can handle.

link

golol 1189 days ago

It's wrong. A decoder only transformer performs a (possibly random) operation on a state from the state space {tokens}^CtxWindow, where the distribution of the new state depends entirely on the previous state. It is a Markov Chain with a special structure: The new state is deterministically equal to the old state shifted by one, with only the last token being newly generated.

link

Mike_12345 1188 days ago

Then by that reasoning everything in the physical world is a Markov chain, right? That is like saying that any deterministic process in time is a Markov chain.

A tennis ball in flight is a Markov chain since the state at t is a function of the state at t-1.

You have missed the point about the Attention Mechanism in GPT. That is not a Markov chain by definition.

link

golol 1188 days ago

>Then by that reasoning everything in the physical world is a Markov chain, right?

Well I guess maybe it's true that you can turn any stochastic process into a Markov Chain by changing the state space somehow (for example the states could be sample trajectories up to some finite time T). And while this is true it may be not very insightful.

But I personally think that to understand LLMs it is much better to think of the whole context window as a state rather than the individual states. If you modelled a simple register-instruction computer as a stochatic process, would you take the states to be (address last symbol written, last symbol written)? It makes much more sense to take the whole memory as a state. Similarly a transformer operates on its memory, the context window, so that should be seen as the state. This makes it clear that seeing it as just a stochastic parrot is misleading, as its all about conditioning the distribution of the next token via prompt engineering the previous tokens. And it is nevertheless a Markov chain with this state space.

link

Mike_12345 1187 days ago

So basically you're saying it's just an algorithm running on a computer? Yes I agree with that.

link

killerstorm 1188 days ago

"Markov chain" might mean:

* a kind of stochastic model * a "naive" realization of that model which directly counts frequencies of N-dimensional vectors

This naive implementation is sometimes used for language modeling, e.g. for the purpose of compression. So people might think you mean that particular implementation rather than a theoretical model.

This sort of a description can be unhelpful.

link

nl 1189 days ago

It's not. It can do in context learning, which Markov chains cannot do.

link

golol 1189 days ago

It is a Markov Chain on the state space {Tokens}^CtxWindow.

link

nl 1188 days ago

I don't think that's clear at all.

https://arxiv.org/abs/2212.10559 shows a LLM is doing gradient descent on the context window at inference time.

If it's learning relationships between concepts at runtime based on information in the context window then it seems about as useful to say it is a Markov chain as it is to say that a human is a Markov chain. Perhaps we are, but the "current state" is unmeasurably complex.

link

golol 1188 days ago

Well all the information it learns at runtime is encoded in the context window. I don't feel like {tokens}^ctxWindow is unmeasurably complex. I think one should see a transformer as a stochastic computer operating on its memory. If you modelled a computer as a stochastic process, would you taje the state space to consist of the most recent instruction, or instead the whole memory of the computer?

link

nl 1186 days ago

GPT-4 has a token window of 32K tokens. I don't think GPT-4's vocabulary size has been released but GPT-3 is 175K. I guess yes, the complexity is technically measurable but it does seem pretty large!

link