Fair push back, but I do think the LSTM vs Transformers point kinda supports my position in the limit, not refutes. Once the compute bottleneck is removed, LSTMs scale favourably.
https://arxiv.org/pdf/2510.02228 (I believe there's similar work done on vanilla LSTMs, but I'd have to go digging)
So the bottleneck was compute. Which is compatible with 'data or compute'. But to accept your point, at the time the algorothmic advances were useful/did unlock/remove the bottleneck.
A wider point is that eventually (once compute and data are scaled enough) the algorithms are all learning the same representations: https://arxiv.org/pdf/2405.07987
Algorithms do matter because compute is not unlimited in practice. Otherwise we might as well use bogo sort because the result is eventually the same. Yes the platonic ideal of a sorted list looks the same but that doesn’t tell you anything about how to get there or whether you can in this lifetime.
I bring up transformers because scaling compute and data was unlocked by a better algorithm. It matters a lot because scaling compute isn’t always an option.
I bring up transformers because scaling compute and data was unlocked by a better algorithm. It matters a lot because scaling compute isn’t always an option.