Hacker News new | ask | show | jobs
by pstch 2389 days ago
An interesting - but not surprising - thing about this is that compression algorithms can be more efficient on wider representations of numerically-high code points (e.g, for some Korean corpus, using UTF-32 instead of UTF-8 improves LZMA compression by ~10%).
1 comments

How well does that corpus compress with LZMA if using a Korean specific character code (such as EUC-KR)? And what about other combinations, with other character codings and other compression algorithms?
EUC-KR doesn't improve much with LZMA (2% over UTF-16), but is better with gzip-9 (10% over UTF-16). I haven't studied this extensively, just did a few tests when waiting for it to download.