It probably means they have tried it for _some_ purpose, but not necessarily the one described in OP's post here. The claim is that this is specifically useful for quantization. It's seems reasonable to assume that this would have initially been tried and potentially discarded for having little or impact on general accuracy. But that's a different issue. I suppose we'll here something definitive in a month or so.
If you take the inner product between a lot of more or less random vectors (the key and query vectors in attention) most values are going to be close to 0. This means they contribute by e^0 to the denominator. Now, if you have a context length of say 2000, your denominator is already ~ 2000. Increasing it to 2001 doesn't really make a difference.
Adding 1 to the denominator can be useful if you have softmax with just a few options. Not in self-attention where you have thousands.
That simple comment is a strong counterpoint to the entire blog post?
Except with the +1 denominator, it might be that the model trains all of the inputs to become very negative so softmax chucks out close to zeros, whereas it wouldn't bother before because making one prob bigger makes another smaller.
> it might be that the model trains all of the inputs to become very negative
It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.
Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.
Are dummy tokens just tokens that don't have an associated input/output token? Like, a way to give more computational power to the model without splitting the text into more actual tokens?
TL;DR sort of yes. But they're also useful for reasons not related to computational "power".
An example here with an actual algorithm, although it's been a couple of years so my explanation might be a bit wrong in places. and/or i might have gotten the completely wrong end of the stick with the current thread.
--
The CTC (Connectionist Temporal Classification [0]) algorithm maps a sequence x with length X -> sequence y with length Y.
i.e. in speech to text we might have some audio features that correspond to the following class predictions (post softmax classification)
x -> hellllloooooooooo wwwooorrrllld
we want to get this as the output
y -> hello world
we have the alphabet as classes we try to predict for each sequence item in x.
we could just removed all the duplicate in the first long sequence, but we would end up with `helo world` ... we need to preserve one of the early `l` characters in `hello` somehow
CTC uses a blank token (aka dummy) token to handle potentially deliberately repeated items in sequence x.
By adding the blank token to the classes predictions, we can get the model to predict something like this (post softmax classification)
y* -> hel~l~~oooo~~~~~~ w~~o~~r~~l~~d
The CTC decoder (non-ML decoding algo) heuristically removes repeated tokens. Turning the above into ...
y -> hello world
... the duplicate `o` and `~` characters are removed.
It was a decent enough algorithm for speech-to-text prior to attention/transformers etc.
However, it makes CTC vulnerable to well designed adversarial example attacks because there is a massive bias within models to predict the blank token -- meaning it's very easy to modify input sequence x to switch the output sequence y to include blank tokens for nefarious purposes (the subject of my unfinished phd).
> By adding the blank token to the classes predictions, we can get the model to predict something like this (post softmax classification)
> y* -> hel~l~~oooo~~~~~~ w~~o~~r~~l~~d
This is a great solution. Though that's a dummy token in the output rather than the input. I guess you could do something inverse to do text to speech, but it might be hard to say where to insert the dummy tokens in that case.