Hacker News new | ask | show | jobs
by countWSS 205 days ago
Beauty,symmetry,etc are largely irrelevant, the key point it does not scale and burning gigawatts to compute these matrices(even with all those tricks) will not scale or compete with more efficient/direct methods in the long term. Perhaps transformers are very elaborate sunk-cost fallacy where pivoting to scalable, simpler architecture is treated as "too risky" even when cost of new GPU cluster dwarfs whatever it takes to bring an architecture from 0 to chatGPT level.
1 comments

The whole issue with this industry is that it moves so fast, there is no "long term." You're either in all the way in a likely futile attempt to capture the market or you're not in at all. So you also don't have time to really innovate on the hardware or software level and you need to put everything into training data and training hardware.