| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pstch 2389 days ago
	An interesting - but not surprising - thing about this is that compression algorithms can be more efficient on wider representations of numerically-high code points (e.g, for some Korean corpus, using UTF-32 instead of UTF-8 improves LZMA compression by ~10%).

1 comments

zzo38computer 2389 days ago

How well does that corpus compress with LZMA if using a Korean specific character code (such as EUC-KR)? And what about other combinations, with other character codings and other compression algorithms?

link

pstch 2389 days ago

EUC-KR doesn't improve much with LZMA (2% over UTF-16), but is better with gzip-9 (10% over UTF-16). I haven't studied this extensively, just did a few tests when waiting for it to download.

link