| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by onionisafruit 238 days ago

> It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all

This goes against my limited understanding of how LLMs work — and computers generally for that matter. Isn’t that rendering of a smiling emoji still just a series of bits that need to be interpreted as a smiley face? The similar looking characters point makes more sense to me though assuming it’s something along the lines of recognizing that “S” and “$” are roughly the same thing except for the line down the middle. Still that seems like something that doesn’t come up much and is probably covered by observations made in the training corpus.

All that said, Karpathy knows way more than I will ever know on the subject, and I’m only posting my uninformed take here in hopes somebody will correct me in a way I understand.

1 comments

jncfhnb 238 days ago

You’re reading it backwards. He is not praising that behavior, he is complaining about it. He is saying that bots _should_ parse smiling face emoji’s as smiling face emoji’s, but they don’t do that currently because as text they get passed as gross unicode that has a lot of ambiguity and just happens to ultimately get rendered as a face to end users.

link

ares623 238 days ago

Wouldn’t the training or whatever make that unicode sequence effectively a smiley face?

link

jncfhnb 238 days ago

Yes, but the same face gets represented by many unique strings. Strings which may more may not be tokenized into a single clean “smiley face” token.

link

scotty79 238 days ago

Don't ask ChatGPT about seahorse emoji.

link

astrange 238 days ago

That's caused by the sampler and chatbot UI not being part of the LLM. It doesn't get to see its own output before it's sent out.

link

tensor 238 days ago

Don't ask humans either, apparently.

link