| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lyonlim 3142 days ago
	Curious how Twitter detects when it's Chinese or Japanese to restrict the length to 140 chars. What if it's a tweet that starts with English, then includes Chinese characters? Is that what the progress indicator is for, to let Twitter detect the language?

3 comments

darkengine 3142 days ago

They've made CJK characters (including fullwidth characters of any kind, without exception for Latin letters or Arabic numerals) count as 2 characters. In a mixed alphanumeric/CJK tweet, the CJK part will simply count doubly towards your character limit.

link

kurthr 3142 days ago

Interesting... of course 70 characters of Mandarin would be the equivalent of over 400 characters of English. I suppose that plus goog-Translate would be another even less readable way to avoid the limit... since Chinese language speakers by and large just use WeChat.

link

lifthrasiir 3142 days ago

It is a brain-dead solution: any Unicode scalar value not matching /[\u0000-\u10ff\u2000-\u200d\u2010-\u201f\u2032-\u2037]/ doubles the cost. [1] The primary range ends at U+10FF because it conveniently excludes virtually all CJK characters (Hangul starts at U+1100) with relatively low error rates. Yet, it's still brain-dead.

[1] https://twitter.com/FakeUnicode/status/928030981805588480 (used to be /[\u0000-\u10ff]/ during the test period)

link

akanet 3142 days ago

Try it! I wonder, too.

link