SoftmaxA[n] - SoftmaxB[n] is exactly 0?
Even if 2 attention layers learn two different things, I would imagine the corresponding weights in each layer wouldn’t exactly cancel each other out.