Hacker News new | ask | show | jobs
by shmageggy 1596 days ago
The "algorithm" isn't too mysterious, especially in light of your observation that it does better at the beginning and end digits. It's just doing what transformers do: predicting the probability of a token given the tokens it can attend to. Assume 20B parameters is enough to memorize an addition table. Then the first digit or two is relatively predictable, as are the last, and as is the length, aka the probability of a space token. The middle tokens are less predictable. This is consistent with the result.

Furthermore, it doesn't even really need to memorize the addition table in the explicit way this suggests. Think about the probability of certain digit tokens appearing given the presence of numbers and plus signs in its data. Thus a behavior consistent with having memorized an addition table emerges from mimicking its training data.

1 comments

It's a little bit more complex here because tokens are variable-length. So getting the order of magnitude (i.e. number of digits) correct requires that it be able to pick tokens for the beginning and end that have the right start/end digit, and then figure out how to make the middle the right length.

And sure, of course it emerged from mimicking (or more precisely, learning to predict the most likely next token in) its training data – that's how it was trained, it can't have emerged from anything else :) But that doesn't tell us what the higher-level algorithm represented by the weights of the network is. I'm talking about things like this for understanding an algorithm for curve detection learned by a convolutional neural network: https://distill.pub/2020/circuits/curve-circuits/