Hacker News new | ask | show | jobs
by lyonlim 3142 days ago
Curious how Twitter detects when it's Chinese or Japanese to restrict the length to 140 chars. What if it's a tweet that starts with English, then includes Chinese characters? Is that what the progress indicator is for, to let Twitter detect the language?
3 comments

They've made CJK characters (including fullwidth characters of any kind, without exception for Latin letters or Arabic numerals) count as 2 characters. In a mixed alphanumeric/CJK tweet, the CJK part will simply count doubly towards your character limit.
Interesting... of course 70 characters of Mandarin would be the equivalent of over 400 characters of English. I suppose that plus goog-Translate would be another even less readable way to avoid the limit... since Chinese language speakers by and large just use WeChat.
It is a brain-dead solution: any Unicode scalar value not matching /[\u0000-\u10ff\u2000-\u200d\u2010-\u201f\u2032-\u2037]/ doubles the cost. [1] The primary range ends at U+10FF because it conveniently excludes virtually all CJK characters (Hangul starts at U+1100) with relatively low error rates. Yet, it's still brain-dead.

[1] https://twitter.com/FakeUnicode/status/928030981805588480 (used to be /[\u0000-\u10ff]/ during the test period)

Try it! I wonder, too.