|
|
|
|
|
by microtonal
811 days ago
|
|
Nice finding and makes a lot of sense! It is somewhat related to classification heads using their own weighted representation of all transformer layer outputs. I only glanced the paper, but they don't seem to softmax ⍺_i for normalization? |
|