Hacker News new | ask | show | jobs
by microtonal 811 days ago
Nice finding and makes a lot of sense! It is somewhat related to classification heads using their own weighted representation of all transformer layer outputs.

I only glanced the paper, but they don't seem to softmax ⍺_i for normalization?