| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by keithwinstein 4704 days ago

Just to be a bit pedantic, unfortunately you don't get "proper i18n support" just by putting everything in UTF-8.

Unicode lets you represent lots of abstract characters, from different languages and societies, in one character set. That doesn't quite tell you how to render the characters. For that, you need to know what language the text is in. Unicode wants you to provide that information out-of-band, e.g. in an HTML "lang" attribute, which the renderer can use to paint the proper glyphs.

For example, the Arabic digits 4 through 7 (۴ U+06F4 .. ۷ U+06F7) have different glyphs in Persian, Sindhi, and Urdu. And a character like 直 (U+76F4) has Chinese and Japanese glyphs that may not be mutually recognizable.

Bottom line: if you want an internationalized system that can store and render multilingual text, storing the text in Unicode is a good start, but you will need to store additional info (like the language) to be able to properly render the text.

1 comments

DougWebb 4704 days ago

I found http://en.wikipedia.org/wiki/Eastern_Arabic_numerals which shows examples of the differences in those numerals, but it looks like the different representations have different Unicode codepoint. So, there's no need for the lang attribute. (The page uses them, but if you take them off there's no difference in the display.)

You probably need to know the language to do things like sorting, comparison, regex, etc. But if you're just storing and displaying user-entered strings and your software has no need to understand the meaning of the strings, I think it's enough to do what the parent says.

link

keithwinstein 4704 days ago

Not quite. The Wikipedia article shows the difference between U+0660 .. U+0669 (Arabic-Indic digits) on the top row and U+06F0 .. U+06F9 (Eastern Arabic-Indic digits) on the bottom row.

But what I'm talking about are the different glyphs used to represent the bottom row (U+06F0 .. U+06F9) depending on whether the text is in Persian, Sindhi, or Urdu. See http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf, table 8-2.

There is also the issue I mentioned about Chinese vs. Japanese glyphs for the same coded character, which is at least as important in practice.

link

ics 4704 days ago

This is an issue with CJK characters and probably just one more reason why UTF-8 adoption has been slow where JIS is good enough.

link