| HN Mirror

It's a little bit more complex here because tokens are variable-length. So getting the order of magnitude (i.e. number of digits) correct requires that it be able to pick tokens for the beginning and end that have the right start/end digit, and then figure out how to make the middle the right length.

And sure, of course it emerged from mimicking (or more precisely, learning to predict the most likely next token in) its training data – that's how it was trained, it can't have emerged from anything else :) But that doesn't tell us what the higher-level algorithm represented by the weights of the network is. I'm talking about things like this for understanding an algorithm for curve detection learned by a convolutional neural network: https://distill.pub/2020/circuits/curve-circuits/