| What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine. And not only addition: all four arithmetic operations. The technique proposed in the article -imposing a strong inductive bias for addition- kiind of works for multiplication, but not for subtraction or division (clearly; I can't even find the words in the paper). As a practical way to build a machine to do arithmetic this is out of the question. We've known how to mechanise arithmetic since the 1850's with Blaize Pascal and his Pascaline. What is the point in demonstrating it's possible to reinvent a broken, partial, buggy version of an arithmetic machine if one tries really hard and shoehorns the necessary patterns in a neural net? We've known that for a long time, too (every proof that a neural net can simulate this or that Turing machine if you design the network diagram and set the weights by hand, ever). So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares? |
Another way to phrase it: Given a physical process that generates discrete time series trajectories, can our current transformer + SGD method learn to emulate the underlying physical processes by observing sample trajectories?
This question can be somewhat mathematically stated but it is quite difficult because there are still some words in there where I used common sense. For example mathematically there will always exist weird counterexamples, so you would have to quantify things very carefully. That's very difficult, so experiments are the best we can do right now.
Hence any instance where transformers fail to learn a Marko process are very interesting. Example: Addition of random numbers.