Hacker News new | ask | show | jobs
by golol 757 days ago
Ok you want the general answer? Consider a discrete time Markov process with memory length N on a finite state space. Train a transformer with context length N on sample trajectories with SGD. Can you expect the transformer to become a good approximation for the dynamics of the Markov process? More specifically, suppose your Markov process is generated by some algorithm/Turing machine couple with some random data. Then, can you expect the transformer to learn to emulate the behavior of the underlying Turing machine, even when run on data which was notnin the initial distribution?

Another way to phrase it: Given a physical process that generates discrete time series trajectories, can our current transformer + SGD method learn to emulate the underlying physical processes by observing sample trajectories?

This question can be somewhat mathematically stated but it is quite difficult because there are still some words in there where I used common sense. For example mathematically there will always exist weird counterexamples, so you would have to quantify things very carefully. That's very difficult, so experiments are the best we can do right now.

Hence any instance where transformers fail to learn a Marko process are very interesting. Example: Addition of random numbers.

1 comments

Is addition a Markov process? I really don't think so. You can certainly model e.g. integer addition by a Markov process, up to some integer k but addition itself is usually formalised by the Peano axioms, that are not quite Markovian. I guess you can see the relation between S(n) and S(S(n)) as some kind of Markov chain. That's really not a standard view though.

In any case, a complete theory of addition must be correct up to inifinity so you won't get that with any Markov process we can train from data. Although you can learn addition with a simple linear regression, by setting the weights appropriately. That's because a function of a line already includes addition, and multiplication, and that's basically not very different to what the team in the paper above is trying to do. Meaning: they're trying to hand-code the concept of addition in embeddings. It's not 100% because they're also at the same time trying to not 100% encode it, but it's a hard balance to strike.