|
|
|
|
|
by countWSS
205 days ago
|
|
Beauty,symmetry,etc are largely irrelevant,
the key point it does not scale and burning
gigawatts to compute these matrices(even with all those tricks)
will not scale or compete with more efficient/direct methods
in the long term. Perhaps transformers are
very elaborate sunk-cost fallacy where pivoting to
scalable, simpler architecture is treated as "too risky"
even when cost of new GPU cluster dwarfs whatever it
takes to bring an architecture from 0 to chatGPT level. |
|