Hacker News new | ask | show | jobs
by HarHarVeryFunny 644 days ago
Base64 encoding is very simple - it's just taking each 6-bits of the input and encoding (replacing) it as one of the 64 (2^6) characters A-Za-z0-9+/. If the input is 8-bit ASCII text, then each 3 input characters will be encoded as 4 Base64 characters (3 * 8 = 24 bits = 4 * 6-bit Base64 chunks).

So, this is very similar to an LLM having to deal with tokenized input, but instead of sequences of tokens representing words you've got sequences of Base64 characters representing words.

1 comments

It's not about how simple B64 is or isn't. In fact i chose a simple problem we've already solved algorithmically on purpose. It's that all you've just said, reasonable as it may sound is entirely speculation.

Maybe "no idea" was a bit much for this example but any idea certainly didn't come from seeing the matrices themselves fly.

That's not entirely true in the case of base64 because of how statistical patterns within natural languages work. For example, you can use frequency analysis to decrypt a monoalphabetic substitution cipher on pretty much any language if you have a frequency table for character n-grams of the language, even with small numbers for n. This is a much more shallow statistical processing than what's going on within an LLM so I don't think many were surprised that a transformer stack and attention heads could decode base64. Especially if there were also examples of base64-encoding in the training data (even without parallel corpora for their encodings).

It doesn't explain higher level generalizations like being a transpiler between different programming languages that didn't have any side-by-side examples in the training data. Or giving an answer in the voice of some celebrity. Or being able to find entire rhyming word sequences across languages. These are probably more like the kind of unexplainable generalizations that you were referring to.

I think it may be better to frame it in terms of accuracy vs precision. Many people can explain accurately what an LLM is doing under all those matrix multiplies, both during training and inference. But, precisely why an input leads to the resulting output is not explainable. Being able to do that would involve "seeing" the shape of the hypersurface of the entire language model, which as sibling commenters have mentioned is quite difficult even when aided by probing tools.

Huh? I just pointed out what Base64 encoding actually is - not some complex algorithm, but effectively just a tokenization scheme.

This isn't speculation - I've implemented Base64 decode/encode myself, and you can google for the definition if you don't believe I've accurately described it!

The speculation here is not about what b64 text is. It's about how the LLM has learnt to process it.

Edit: Basically, For all anyone knows, it treats b64 as another language entirely and decoding it is akin in the network to translating French rather than the very simple swapping you've just described.

LLMs, just like all modern neural nets, are trained via gradient descent which means following the most direct path (steepest gradient on the error surface) to reduce the error, with no more changes to weights once the error gradient is zero.

Complexity builds upon simplicity, and the LLM will begin by noticing the direct (and repeated without variation) predictive relationship between Base64 encoded text and corresponding plain text in the training set. Having learnt this simple way to predict Base64 decoding/encoding, there is simply no mechanism whereby it could change to a more complex "like translating French" way of doing it. Once the training process has discovered that Base64 text decoding can be PERFECTLY predicted by a simple mapping, then the training error will be zero and no more changes (unnecessary complexification) will take place.

Isn’t the gradient descent used, stochastic gradient descent? I think that could matter a little bit.

Also, the base model when responding to base64 text, most of the time the next token is also part of the base64 text, right? So presumably the first thing to learn would be like, predicting how some base64 text continues, which, when the base64 text is an encoding of some ascii text, seems like it would involve picking up on the patterns for that?

I would think that there would be both those cases, and cases where the plaintext is present before or after.

Yes, most examples in the training set presumably consist of a block of B64 encoded text followed by the corresponding block of plain text.

However, Transformer self-attention is based on key-based lookup rather than adjacency, although embeddings do include positional encoding so it can also use position where useful.

At the end of the day though, this is one of the easiest types of prediction for a transformer/LLM to learn, since (notwithstanding that we're dealing with blocks), we've just got B64 directly followed by the corresponding plain text, so it's a direct 1:1 correspondence of "when you see X, predict Y", as opposed to most other language use where what follows what is far harder to predict.

Modern Neural Networks are by no means guaranteed to converge on the simplest solution. and examples abound in which NNs are discovered to learn weird esoteric algorithms when simpler ones exist. The reason why is kind of obvious. The simplest solution (that you're alluding to) from the perspective of training is simply what works best first.

It's no secret the order of data has an impact on what the network learns and how quickly, it's just not feasible to police for these giant trillion token datasets.

If a NN learns a more complex solution that works perfectly for a less complex subset it meets later on, there is little pressure to meet the simpler solution. Especially when we're talking about instances where the more complex solution might be more robust to any weird permutations it might meet on the internet. e.g there is probably a simpler way to translate text that never has typos and a LLM will never converge on it.

Decoding/Encoding b64 is not the first thing it will learn. It will learn to predict it first as it predicts any other language carrying sequence. Then, it will learn to translate it, mostly like long after learning how to translate other languages. All that will have some impact on the exact process it carries out with b64.

And like i said, we already know for a fact it's not just doing naive substitution because it can recover corrupted b64 text wholesale that our substitutions cannot.

> examples abound in which NNs are discovered to learn weird esoteric algorithms when simpler ones exist

What examples do you have in mind?

Normally it's the opposite, where one hopes for the neural net to learn something complex, and it picks up on a far simpler pattern and uses that instead (e.g. all your enemy tanks are on a desert background, vs the others on a grass background, so it learns to discriminate based on sand vs grass).

You're anthmorphizing by saying that corrupted b64 text can be recovered. There is no "recovery process", but rather conflicting prediction patterns of b64 encoding predicting the corresponding plain text, and the plain text predicting it's own continuation.

e.g.

"the cat sat on the mat" encodes as dGhlIGNhdCBzYXQgb24gdGhlIG1hdA==, but say we've instead got a corrupted dGhlIGNhdCBzYXQgb24gdGhlIHh4dA== that decodes to "the cat sat on the xxt", so if you ask ChatGPT to decode this, it might start generating as:

dGhlIGNhdCBzYXQgb24gdGhlIHh4dA== decodes to "the cat sat on the" ...

At this point the LLM has two conflicting predictions - the b64 encoding predicting "xxt", and the plain text that it has generated so far predicting "mat". Which of these will prevail is going to depend on the specifics. I haven't tried it, but presumably this "recovery" only works where the encoded text is itself predictable ... it won't happen if you encode a random string of characters.