| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cyberge99 40 days ago
	Forgive me if this is a naive assumption, but wouldn’t large language models be fundamentally different for a language that is largely symbols? Again, my understanding of Mandarin is limited if it exists at all.

2 comments

doph 40 days ago

All tokens are symbols. All of the frontier models speak Mandarin.

link

boothby 40 days ago

This is why misspellings and homophones are tells of human righting. LLMs strongly prefer word-level tokens, and word substitutions follow semantic similarity and not the more human auditory similarity.

link

omneity 40 days ago

Funny, I’ve been cracking[0] at this exact problem with a purpose-built model[1]:

0: https://huggingface.co/posts/omarkamali/593639295164067

1: https://omneitylabs.com/models/sawtone

link

jddj 40 days ago

Claude the other day wrote code where one of the bytes in the array was 0xO5.

That's zero ex oh (the letter) five

link

mejutoco 40 days ago

> righting.

> LLMs strongly prefer word-level tokens, and word substitutions follow semantic similarity and not the more human auditory similarity.

Is this an elaborate joke or your full-word misspelling of writing is both agreeing with your statement (word substitutions) and contradicting it (not semantic but only pronunciation similarity)

link

calfuris 40 days ago

I don't see the contradiction, unless you believe that the grandparent comment was written by an LLM.

link

wat10000 40 days ago

"飞机" and "airplane" aren't fundamentally different in terms of how they're represented to a computer. Especially for an LLM, where tokenization likely turns each of those into a single token.

link