Hacker News new | ask | show | jobs
by DougWebb 4704 days ago
I found http://en.wikipedia.org/wiki/Eastern_Arabic_numerals which shows examples of the differences in those numerals, but it looks like the different representations have different Unicode codepoint. So, there's no need for the lang attribute. (The page uses them, but if you take them off there's no difference in the display.)

You probably need to know the language to do things like sorting, comparison, regex, etc. But if you're just storing and displaying user-entered strings and your software has no need to understand the meaning of the strings, I think it's enough to do what the parent says.

1 comments

Not quite. The Wikipedia article shows the difference between U+0660 .. U+0669 (Arabic-Indic digits) on the top row and U+06F0 .. U+06F9 (Eastern Arabic-Indic digits) on the bottom row.

But what I'm talking about are the different glyphs used to represent the bottom row (U+06F0 .. U+06F9) depending on whether the text is in Persian, Sindhi, or Urdu. See http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf, table 8-2.

There is also the issue I mentioned about Chinese vs. Japanese glyphs for the same coded character, which is at least as important in practice.

This is an issue with CJK characters and probably just one more reason why UTF-8 adoption has been slow where JIS is good enough.