Hacker News new | ask | show | jobs
by jgalt212 4305 days ago
Didn't read the article, but I did grep it for unicode and utf-8 and found no matches, so this might also be related.

Twitter lets you tweet 140 characters regardless of the bit width of those characters in. For Japanese, I think almost all characters take two bytes in utf-8. As such, given the same number of tweets, the bandwidth usage is approx 2X.

Twitter also seems much more useful in ideogram languages as 140 characters = 140 words = an article. In English, 140 characters = a short/medium sized sentence.

5 comments

It was completely and utterly not about that, which I believe actually surprised many of the readers of the article.

I wonder if they lost viewership by making a statement in their headline that people thought they knew the answer to already.

Sorry if this comes off negatively, but I'd like to correct some of the things you said in this comment. Chinese characters are closer to logograms than ideograms. Chinese characters are more than a representation of an idea.

The vast majority of words in Chinese and Japanese are not single characters. In fact, a very large portion of the characters cannot be used by themselves, at least in modern Japanese. Japanese is also heavily reliant on two other writing systems which are far less space-efficient. A single character of either hiragana or katakana represents only a single syllable of sound (such as か(ka),ぽ(po),し(shi),etc.). Unless your tweets were just very long noun phrases without any grammar pertaining only to things that can be written with Chinese characters (which means no modern foreign words), less efficient writing systems would be needed.

So, while Japanese may be slightly more compact due to the lack of spaces and the ability to assign a large number of sounds to a single Chinese character (such as 承る=うけたまわる(uketamawaru)), it's still a synthetic language that can have very large conjugations (which have to be written in hiragana) and has a very unfavorable ratio of amount of meaning:syllables (partially mitigated by the use of Chinese characters).

Japanese is often encoded in Shift_JIS, which is much better than UTF-8 (for japanese text). Most browsers default to that encoding on japanese OSes, so there probably isn't a real bandwidth issue, depending on the ratio of browser to dedicated client usage.

As for ideograms, your statement is not correct. Kanjis allow for better content/character ratios, certainly, but one kanji is very often not one word - the word foreigner, for example, uses 3 ideograms.

On top of that, Japanese is not written solely with kanjis (as opposed to Chinese, for example). It also uses katakanas and hiraganas, which stand for phonems. This is more often the case on social networks where a lot of western words are used - western words are almost systematically written with katakanas.

Kanas are still more "efficient" than alphabets, but to reuse my previous example, foreigner is written using 5 kanas instead of 3 kanjis.

I know it's such a tiny point, and very much off topic, but please don't "pluralize" words such as Kanji, Katakana, etc. As someone who has spent a lot of time learning Japanese, and who knows that Japanese words don't change between singular/plural (with certain exceptions, such as attaching "-tachi" to a word), seeing you write "Kanjis" or "Katakanas" so many times in a row really bothers me.

I'll go back to my corner and leave you alone now.

Well, he's speaking English, not Japanese. Do you refer to multiple pizzas as pizze?
It's more like the plural of 'deer' being 'deer'. I don't think I've ever heard anyone attach an 's' to 'kanji' for pluralization when speaking English. Like how it might be odd to say 'sushis' or 'wasabis'. Whenever I need to stress plurality i would say 'kanji characters' or something like that. Oxford English Dictionary lists 'sushi', 'kanji', 'shinkansen', 'katakana' all as being mass nouns or having the plural form the same as the singular. The exception in the words I looked up was 'tsunami' which may be pluralized as 'tsunami' or 'tsunamis'.
Most of those are the sorts of nouns that wouldn't normally be pluralized in English. "Sushi" is like "rice" — specifies what the roll is made of rather than the roll itself. We don't pluralize the name of a rail system like Shinkansen because there is only one of it (similarly, "the L" but not "the Ls"). But there are many Japanese loanwords that are commonly pluralized differently in English. For example, futons, tycoons, typhoons, tatamis, ninjas and kimonos.

It's ambiguous whether "kanji" is a mass noun referring to the character set as a whole or a singular noun referring to a character in the set. I think it's both. So it seems hard to blame someone for being unclear on the matter.

Eh, English has a long history of pedants insisting that the original pluralization of loanwords be used. If people are going to push for indices instead of indexes, octopodes instead of octopuses, and rooves instead of roofs, there's no reason not to use the Japanese pluralization of loanwords when appropriate.
Actually, you're wrong. The plural of "kanji" in English is "kanji": http://www.merriam-webster.com/dictionary/kanji
That was not the basis of FreezerburnV's complaint, which was based on how it's done in Japanese, not English. So I don't believe SunShiranui's point is wrong. The fact that the plural of "kanji" is "kanji" in English does not establish a blanket rule that every word must be pluralized according to its language of origin. (And that is good, because pluralizing "cherry" would be a nightmare! It's an over-singularized form of the already-singular French "cherise.")
I apologise, I was not aware you could not pluralise these words. My only excuse is that english is not my first language.

I'd go back and edit my comment, but that'd just make yours seem weird...

English has many anomalies and plural forms are often strange. That said, in the texts I've read, they did not add an 's' to kana, kanji, katakana or hiragana, but used them as their own plural forms.

I wouldn't have guessed that you were not a native speaker, though. You write well.

Just for information, be aware that you should not write 'softwares' or 'informations' either. Those aren't words.

So is twitter sending Shift_JIS encoded responses back? That seems unlikely given it seems easier to engineer accepting anything but always send utf-8 back, but definitely interesting if that's the case.

Also, why downvote? Even if was largely wrong, the discussion engendered was a net positive to HN.

I think twitter generates answers based on the client's Accept-Encoding header. Well, I assumed that. Come to think of it though, it does seem unlikely and you make a good point.

I did not downvote, or if I did I did not mean to, op's comment certainly did not deserve it.

No, the 140 characters don't mean 140 bytes [0]. A zigzag read seems to indicate that it's much more accepting than that: Every character is normalized to a preset format, such that combinations like é are represented in a single codepoint (and not "e plus diacritic" which would be two), and then you count the number of codepoints.

So, contrary to popular belief, the languages that could be discriminated seem to not be chinese or japanese, but languages with possible combinations on each character, such as european languages.

I thought it would be about that too, but instead it was about punctuality.