| HN Mirror

Should math about a vaguely related topic convince me about this? Multilevel ANNs act differently than one-level ANNs. Transformers simply don't have anything to do with the model of approximating functions by assembling piecewise functions. This is akin to arguing that computers can't copy files because the disjunctive normal form sometimes needs exponential terms on bit inputs, so obviously it cannot scale to large data sets - yes, that is true about the DNF, but copying files on a computer simply does not use boolean operations in a way that would run into that limitation.

The way that Transformers learn has more to do with their multilayering than with the transformation across any one layer. Universal approximation only describes the things the network learns across any pair of layers, but the input and output features that it learns about in the middle are only tangentially related to the training samples. You cannot predict the capabilities of a deep neural network by considering the limitations of a one-layer learner.