Hacker News new | ask | show | jobs
by lukeschlather 1172 days ago
I would want to see some data on tokenization for some real-world examples. "Je voudrais une pizza" actually translates more directly to "I would like a pizza" which is 5 tokens. But also I think there's some danger here in terms of this might be cherrypicking examples. Spanish is a lot more dense than English or French and might tokenize better. (I see "quiero pizza" is 4 tokens which seems like the right number of tokens to me - "quiero" actually contains "I want <present tense>") You could argue it's 2 or 3 tokens but 4 seems preferable.

For diacratics in French or Spanish, diacratics are logically characters. I can't think of an example where it's actually useful to split the letter into a different token but I could see it happening and not being harmful. I do think it's possible French is just weird and just needs more tokens. When I think about how I process French, I probably do treat e.g. "Je l'ai aimé" as a pathological example as 3 tokens when I speak it out loud. But I can also see why you would tokenize it as 6 tokens, I'm not sure that's Anglocentrism so much as it's recognizing a complexity difference between French and English writing.

But all this is contrast to how non-roman characters are tokenized at the byte level. That just seems bad and like it's definitely going to make it worse with non-roman languages. There's no point in having tokens that split characters.

3 comments

> Spanish is a lot more dense than English or French and might tokenize better.

I'm no linguist, so I apologize if I'm misinterpreting this statement. My impression has always been that Spanish is less dense than English, only because in almost all cases, the Spanish version of product instructions is wordier. Look at the back of a shampoo bottle[0] and notice that the Spanish version is either longer, or a smaller font, to fit it all.

[0] https://i.postimg.cc/xd2X5WJN/Ghub-Fo-N11u8jz-Pjj-RDt-W-CGA9...

Instruction manuals are going to be translated and they're hopefully verbose such as to be explicit.

One area where Spanish is more dense is verb forms, because it retains most of the inflected verbs of Latin, whereas English has lost or merged together a lot of the historical Indo-European inflections. Speaking intuitively, I think it, like most Latin languages, tends to be a bit more verbose with noun phrases.

Another way to measure this is speaking rate. What I remember from linguistics courses is 1) that whole different cultures seem to speak at different average speeds, the information content transferred per second of speech seems to be remarkably consistent across languages; and 2) people speak Spanish more quickly than they speak English.
It's probably not a good idea to judge the density of a language by product instructions that are probably a minimally workable translation into the language.
I just found this tiktokenizer project in GH that might be of help to you https://tiktokenizer.vercel.app/
openAI released an official tokenizer recently:

https://platform.openai.com/tokenizer