|
|
|
|
|
by f38zf5vdt
612 days ago
|
|
There is a downside, every attention layer has to effectively compute attention twice (run scaled_dot_product_attention). As scaled_dot_product_attention is usually one of the most expensive operations in training and inference of a model, it seems like networks using this may be substantially slow and perhaps should considered against larger networks with more attention layers. https://github.com/microsoft/unilm/blob/master/Diff-Transfor... |
|
I wonder if the specific setup might be extra effective for coding tuned models as well - you get one coding transformer and one ‘bad habits/chat/other non coding stuff’ negative transformer.