| HN Mirror

Yes, most examples in the training set presumably consist of a block of B64 encoded text followed by the corresponding block of plain text.

However, Transformer self-attention is based on key-based lookup rather than adjacency, although embeddings do include positional encoding so it can also use position where useful.

At the end of the day though, this is one of the easiest types of prediction for a transformer/LLM to learn, since (notwithstanding that we're dealing with blocks), we've just got B64 directly followed by the corresponding plain text, so it's a direct 1:1 correspondence of "when you see X, predict Y", as opposed to most other language use where what follows what is far harder to predict.