Hacker News new | ask | show | jobs
by magicalhippo 613 days ago
> Surely there's a trade-off...

For one, speed and memory. They have twice as many Q and K weights in the attention blocks, leading to a ~10% reduction in throughput on their H100 (table 7 in appendix A).

2 comments

they mention similar performance to vanilla transformer with significantly reduced param count though
I mean it doesn’t necessarily needs 2x QK to match that performance, in terms of accuracy, of a regular transformer right?