| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dartos 660 days ago

I mean wouldn’t it be unlikely that

SoftmaxA[n] - SoftmaxB[n] is exactly 0?

Even if 2 attention layers learn two different things, I would imagine the corresponding weights in each layer wouldn’t exactly cancel each other out.