| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by svennek 3275 days ago

If you see the converting as hoop-jumping then I agree with you.

If you do recognize that there is such a thing called text, that human understand, and for example the text "hello" has five "letters".... The point that that text might be represented in five different (and equally valid) ways with different amounts of bytes is irrelevant to the "human user"...

Hence the sharp distinction between makes sense to most humans.

The danish word "rødgrød" for example is seven letters (i.e. every person would say it is 7 letters long), a computer would call it 7, 7 9 or 16 bytes (in cp865 (nordics) latin1("wester european", utf8 and utf16...)

Unless you work in the ideal "text" mode getting the correct length of the text is not that trivial. Equally, the pair 2-3 and and 7-8 bytes in utf8 must not be split (as either half has not meaning on its own).. hence (in utf8) a function "give me the next letter" will return 1,2,1,1,1,2,1 bytes in its revocations..

Also casefolding (upper/lower case) is hard when working in bytes (some glyphs might not even have one or the other)..

I am unable to say if unicode fits all languages in the world as needed, but it is much better as byte-wrangling if you have multiple possible encodings at once..

EDIT: for reference the bytes of "rødgrød" are

cp865 : b'r\x9bdgr\x9bd'

latin1 : b'r\xf8dgr\xf8d'

utf8 : b'r\xc3\xb8dgr\xc3\xb8d'

utf16le: b'r\x00\xf8\x00d\x00g\x00r\x00\xf8\x00d\x00'

utf16be: b'\x00r\x00\xf8\x00d\x00g\x00r\x00\xf8\x00d'

(and it cannot be written in ASCII ("C"))

Anyone saying that bytes and text are the same are nuts...