|
|
|
|
|
by shawntan
1121 days ago
|
|
But "don't try to codify 'insight' into the process" seems to suggest "don't try different approaches". I'm not sure how people can at once trot out the "Bitter Lesson" and interpret it as it is written, but still say "We're not saying not to think about new approaches". Is the idea then to work only on methods that allow for faster compute of more data? FWIW, the Transformer works faster on current methods of parallelisation, allowing for dramatic scaling that RNNs will find hard to compete on. But we do pay for that in terms of what can be computed (https://arxiv.org/pdf/2207.00729.pdf - TL;DR: Transformers are limited in the types of programs/functions it can compute because of parallelism). Scaling, ironically, does seem to be the 'direction of steepest descent' in terms of what will bring the best performance (for now). Gradient descent does find pleasant local optima that may keep us happy for a while. |
|
CNNs and Transformers are very different. Both can be used for computer vision. The bitter lesson wouldn't stop you from switching from one to the other.