| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aDyslecticCrow 660 days ago

In Big-O notation, O(2n) = O(n). Two times slower is actually not that much. If this slowdown results in better inference in the same number of training rounds or better-tuned weights with fewer redundant features, that can be a very worthwhile sacrifice.

It's also a complex optimization problem, not just about computing. Two times, the parameters take more than two times the time to tune and two times the working memory to train and use. There are also plenty of model training scenarios where data throughput from the dataset into memory and back out is the final bottleneck.

So, though I agree it is indeed a downside, I think it's a worthwhile sacrifice if the results they show are reproducible.

2 comments

godeldirac 658 days ago

Glad to see your ideas here. Could you clarify a point to me? The W matrix in the paper is d_model x 2d. Does this mean a differential attention model will double the W matrix of a standard attention model, which is d_model x d? E g. Suppose llama3 has W of 8192 x 1024, does the diffattn model of the same architecture have W of 8192 x (1024 x 2)?

link

cuteboy19 660 days ago

The O for any transformer is always quadratic

link