Hacker News new | ask | show | jobs
by codetrotter 1215 days ago
A few days ago I asked ChatGPT if “pannekake” and “kannepake” are anagrams of each other.

It correctly stated that they are, but when it went on to prove that this was the case, it generated a table of the frequencies of the individual letters in these two words, and the table looked like this.

    Letter | Frequency in | Frequency in
           | “pannekake”  | “kannepake”
    - - - - - - - - - - - - - - - - - - -
    a      | 2            | 2
    e      | 2            | 2
    k      | 2            | 2
    n      | 2            | 2
    p      | 2            | 2
This reminded me that yes indeed, AI just isn’t quite there yet. It got it right, but then it didn’t. It hallucinated the frequency count of the letter “p”, which occurs only once, not twice in each of those words.
1 comments

Anything that has to do with individual words doesn't work well, but as I understand, this is an artifact of the tokenization process. E.g. pannekake is internally 4 tokens: pan-ne-k-ake. And I don't think that knowing which tokens correspond to which letter sequences is a part of the training data, so it has to infer that.