Hacker News new | ask | show | jobs
by waitforit 1090 days ago
Just because it's confident, doesn't mean it's right. But it probably contains training data from this or similar sites.

https://enigma.hoerenberg.com/index.php?cat=The%20U534%20mes...

> Interpretation (preliminary):

> [An] U-4701, nachrichtlich [an] U-Stützpunkt Lübeck von Chef 4. U-Flottille: Mit U-4702 und U-4703 zur Flender Werft Lübeck gehen. Von dort folgt Weiteres.

> Translation (preliminary):

> [To] U-4701, for information [to] Submarine Base Lübeck from Chief of 4th Submarine Flotilla: With U-4702 and U-4703 go to Flender Dockyard at Lübeck. From there more follows.

2 comments

There are about 2^67 different Enigma machine initial settings. The inverse probability of the appearance of a real seven letter German word (LUEBECK) twice in a random string of similar length to this message is a number that's pretty close to 2^67. So if you decrypted one ciphertext message with all the different incorrect settings you might expect to see one purported plaintext which isn't correct but which has two appearances of LUEBECK, or a similarly misleading occurence. Since there's also one correct plaintext, seeing LUEBECK twice already puts you at roughly 50/50 that it's the real message versus the most convincing wrong plaintext (if you had no prior knowledge of what the settings might be). The additional presence of even a few of the other recognizable German words (or common abbreviations such as triple letters and the shortened names for the numbers) makes it overwhelmingly likely that this is the correct plaintext. LUEBECK + LUEBECK + STUETZPUNKT in one message make the chance that it's not the real message of the order of winning a jackpot in state lottery two weeks running, even if the rest of the message was gibberish. In practice, much shorter pieces of plaintext than the double LUEBECK (like the presence of a single triple U, one spelled-out number, or highly abbreviated weather info) were used to validate guessed settings with a high degree of confidence.
That would be true if it actually did the decryption, but my point was, an LLM doesn't decrypt. It just has the encrypted string followed by the decrypted string in its training data and so it outputs something that's almost correct. (the numbers being wrong 4501, 4502, 4503 instead of 4701, 4702, 4703 - maybe some bugged training data, maybe hallucination).
Sure, and I said it was confident, not correct.

I find it interesting that it pulled some kind of interpretation from the string. Far more than I would have. I asked it to translate the English data back into a similarly plaintext string and then asked a second instance to decode it and it came back with a similar, slightly distorted response.

The point is more to say that a language model is exactly the sort of thing that would be used to determine whether a given potentially decoded plaintext string is actually decoded, and given various anachronisms and shorthands our personal language models may not be adequate.

But a giant one that's been fed all sorts of data including examples of text of similar usage sounds actually like it might be exactly the tool for this problem.