> softmax should be exp()/1+∑exp()
You referring to Miller's blogpost?[0] There's not an error in attention. Adding the +1 actually makes it not attention because you no longer generate a probability distribution[1]. There's nothing really preventing attention to have a zero in any of the entries, the thing is that you probably won't get -inf (very large negative number) inside inner product and you're going to have a difficult time updating those weights via gradient descent.I've also tested it on many networks and different types of attention and I've yet to see a meaningful improvement (or even an improvement), even in generalization. It really is the training method... As to the paper, I'm also still at a big lost and honestly, if reviewing could not accept it. The results look good, but I can't tell why and there's some "black magic" going on here. - Figure 3 has "Transformer" and doesn't specify. Is this StableLM-3B-4E1T?
- What fucking dataset is this on? Stable has a WandB link[2] for that project and I don't see any experiment with similar (presumably entropy?) loss values (come on... this is fucking research... label your fucking graphs...)
- Where the fuck is the ablation? (Yes, I saw Fig 6 and Sec 3.8)
- How do I know that (assuming this is Stable) that the difference isn't just hyperparemeters? Or worse, GPUs! (yes, number of GPUs can change results due to sharding and this changing the statistics)
- How do I know it isn't down to 1k warmup steps instead of 5k?
- What about hidden size, layers, heads, or FFN size? Stable has 32/2560/32/? and this has 28/3072/12/8192 (these all will mess with sharding statistics too). Is the head dimension the same?
- How do I know it isn't down to the tokenizer?
- What is this magic? `0.8 - 0.6 * math.exp(-0.3 * depth)`
- Was this learned? Hand picked? This is a huge factor
- Any information about the learned parameters? Their final values? Trajectories?
- The code does not seem to be the same as whats in the algos...
Obviously they improved something, but there is nothing in the paper that is convincing me that it is the differential attention. There are too many parameters at play and how am I supposed to know that the difference is by the thing they are proposing. And more importantly, how much it is improved by that specific thing and not by other things. [0] https://www.evanmiller.org/attention-is-off-by-one.html
[1] This is a bit convoluted but without this condition many "alternative forms" you see would be equivalent to other architectures like linear layers or gated units. Term is not well defined, but this really appears to be the only agreed upon aspect, even if only implicitly stated. This is a much longer conversation though.
[2] https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo
[2.1] The config: https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-3b-4e1t.yml
|