|
|
|
|
|
by adamsantoro
807 days ago
|
|
We spent a fair bit of effort ensuring we were accurate with the language and claims, so we're happy to take any feedback and make updates in subsequent versions. However, I don't see where we claim that MoD allocates more to some sequences and not others (specifically, the abstract says "transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence". That said, it's a pretty simple change to make the approach work in the way you describe (allocating more to some sequences and not others) by changing the group across which the top-k works. In the paper we use the time (sequence) dimension, but one could also use the batch * time dimension, which would result in asymmetric allocation across sequences |
|