|
|
|
|
|
by drdeca
644 days ago
|
|
Isn’t the gradient descent used, stochastic gradient descent? I think that could matter a little bit. Also, the base model when responding to base64 text, most of the time the next token is also part of the base64 text, right? So presumably the first thing to learn would be like, predicting how some base64 text continues, which, when the base64 text is an encoding of some ascii text, seems like it would involve picking up on the patterns for that? I would think that there would be both those cases, and cases where the plaintext is present before or after. |
|
However, Transformer self-attention is based on key-based lookup rather than adjacency, although embeddings do include positional encoding so it can also use position where useful.
At the end of the day though, this is one of the easiest types of prediction for a transformer/LLM to learn, since (notwithstanding that we're dealing with blocks), we've just got B64 directly followed by the corresponding plain text, so it's a direct 1:1 correspondence of "when you see X, predict Y", as opposed to most other language use where what follows what is far harder to predict.