|
|
|
|
|
by aDyslecticCrow
615 days ago
|
|
Very clever. I like this kind of nitty-gritty detail work, and the change is small enough to be adapted easily by others. Bravo! I'm a little concerned about the last sentence of the section introduction of
"2 Differential Transformer". It mentions using improvements from previous papers, but in the grammatical context, it's unclear if this improvement is added to both the normal transformer and their diff transformer. This would otherwise sully the comparisons. It's the "main difference" wording in the previous sentence that raised a flag for me. Of course, a good-faith researcher would know this and may not feel the need to clarify. But you can never be too careful about some published research in this field. |
|
In any event, I'd imagine that this will get widely adopted if the numbers hold up; like I said, this seems to be basically no downside, and should be easy to replicate.