| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sheept 54 days ago
	UTF-8 is not locale independent. You cannot correctly render multilingual UTF-8 text without also specifying its locale, and some transformations like uppercase/lowercase also depend on the locale.

4 comments

sourcegrift 54 days ago

Eg: some cjk characters render differently based on whether mainland China, Taiwan, or Japan. One example 骨 (from my old notes so tiny chance this example is incorrect)

link

cyphar 53 days ago

Yeah, 骨 is one but IMHO the best example is 返 -- it renders differently in every CJK locale.

link

Joker_vD 54 days ago

> You cannot correctly render multilingual UTF-8 text without also specifying its locale

You can render it pretty well, not perfect, but good enough to actually read it, as opposed to not being able to render it at all or rendering mojibake à la РљСЂР°РєРѕР·СЏР±СЂС‹ instead.

link

numpad0 54 days ago

At least touching Unicode strings in wrong locales only mildly corrupts the strings. Plenty of Win32 apps would crash if the system locale is in UTF-8.

link

throw1234567891 53 days ago

UTF-8 is a character encoding and therefore it cannot serve as a locale. There is no UTF-8 language, punctuation, date and number formats…

link

numpad0 53 days ago

I mean, UTF-8 string handling is language (of the given bitstream, not necessarily the system) dependent, e.g. Turkish lowercase I, Chinese Hanzi vs Japanese Kanji at same codepoints, etc etc...

link

jech 53 days ago

> UTF-8 is not locale independent.

The encoding itself is locale-independent. Some algorithms (rendering, casing, hyphenation etc.) depend on the locale.

This is unlike the older paradigm, where the encoding itself was dependent on the locale, making things like copy-paste between applications running in different locales problematic.

link