| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by magicalhippo 660 days ago
	> Surely there's a trade-off... For one, speed and memory. They have twice as many Q and K weights in the attention blocks, leading to a ~10% reduction in throughput on their H100 (table 7 in appendix A).

2 comments

they mention similar performance to vanilla transformer with significantly reduced param count though

I mean it doesn’t necessarily needs 2x QK to match that performance, in terms of accuracy, of a regular transformer right?