Hacker News new | ask | show | jobs
by aDyslecticCrow 615 days ago
Very clever. I like this kind of nitty-gritty detail work, and the change is small enough to be adapted easily by others. Bravo!

I'm a little concerned about the last sentence of the section introduction of "2 Differential Transformer". It mentions using improvements from previous papers, but in the grammatical context, it's unclear if this improvement is added to both the normal transformer and their diff transformer. This would otherwise sully the comparisons. It's the "main difference" wording in the previous sentence that raised a flag for me.

Of course, a good-faith researcher would know this and may not feel the need to clarify. But you can never be too careful about some published research in this field.

2 comments

Yes. This looks really, really good to me. Cross the board improvements in training time, perplexity improvements per both token trained and per model size. I'm reminded of MoE architectures, in that world we're choosing an optimal small model to process part or all of the inference job; I wonder if MoE got some of the same benefits from forcing the Transformer to distinguish between alternate possibilities.

In any event, I'd imagine that this will get widely adopted if the numbers hold up; like I said, this seems to be basically no downside, and should be easy to replicate.

There is a downside, every attention layer has to effectively compute attention twice (run scaled_dot_product_attention). As scaled_dot_product_attention is usually one of the most expensive operations in training and inference of a model, it seems like networks using this may be substantially slow and perhaps should considered against larger networks with more attention layers.

https://github.com/microsoft/unilm/blob/master/Diff-Transfor...

Interesting. This is one of those areas where edge inference needs might be different than data center: to get an 11b quality model in 7b at the cost of 30% more inference time is probably a full yes for anyone doing local inference. And let’s remember that memory bandwidth is a huge factor as well; 30% smaller equals 30% boost in memory based time costs. Anyway I’m interested in trying this out.

I wonder if the specific setup might be extra effective for coding tuned models as well - you get one coding transformer and one ‘bad habits/chat/other non coding stuff’ negative transformer.

In Big-O notation, O(2n) = O(n). Two times slower is actually not that much. If this slowdown results in better inference in the same number of training rounds or better-tuned weights with fewer redundant features, that can be a very worthwhile sacrifice.

It's also a complex optimization problem, not just about computing. Two times, the parameters take more than two times the time to tune and two times the working memory to train and use. There are also plenty of model training scenarios where data throughput from the dataset into memory and back out is the final bottleneck.

So, though I agree it is indeed a downside, I think it's a worthwhile sacrifice if the results they show are reproducible.

Glad to see your ideas here. Could you clarify a point to me? The W matrix in the paper is d_model x 2d. Does this mean a differential attention model will double the W matrix of a standard attention model, which is d_model x d? E g. Suppose llama3 has W of 8192 x 1024, does the diffattn model of the same architecture have W of 8192 x (1024 x 2)?
The O for any transformer is always quadratic
The two other changes they mention have been widely adopted, and are included in at least some of the models they benchmark against. It seems they list them for completeness as changes to the original transformer architecture.
Nicely spotted! Then, I really look forward to seeing this method tested by others! Epic stuff.