Hacker News new | ask | show | jobs
by byuu 5162 days ago
Not only kanji, but also hiragana and katakana (syllabic alphabets) encode to three bytes per character. Shift-JIS can encode all three to two bytes, as well as half-width katakana to one byte per character.

However, if size is such a concern (eg for web transmission), text compression neutralizes the perceived benefit of region-specific encodings.

Shift-JIS' continued popularity has much more to do with change aversion than it does technical merit.

3 comments

As I said above, I spoke with several Japanese people who said that some valid characters are not representable in Unicode.

Some details can be found here: http://en.wikipedia.org/wiki/Han_unification

On the web ASCII (think HTML tags, CSS stylesheets, etc) typically is a large fraction of CJK pages, so the relative inefficiency of UTF-8 for encoding is less important.
@ruediger There's nothing wrong with Unicode. UTF-8 sucks because it ends up taking more space.

@byuu No it doesn't. Try compressing a SJIS text using gzip. Then convert it to UTF-8 and do the same thing. With a "perfect" compressor, there shouldn't be any difference since the information contents are the same, but unfortunately we don't have a perfect compression algorithm that hits the theoretical lower bound for compression.