Hacker News new | ask | show | jobs
by stonekyx 1173 days ago
I think I observed this problem recently too, but with Japanese file names.

Basically I had a Git repository and didn’t know about `git config core.precomposeUnicode`. And when the repository is synchronized to a Linux system, on the Linux side it can sometimes have 2 files with the same-looking file name but different normalizations. (Because I think ext4 doesn’t normalize Unicode?) That took me about an afternoon to fix.

2 comments

> Because I think ext4 doesn’t normalize Unicode?

Yeah, native Linux filesystems don't really know anything about unicode at all at their heart. They really work more on raw bytes, simply reserving 0x00 (NUL) and 0x2F ('/'). Anything else goes in a filename as far as the kernel is concerned (including evil stuff like incomplete multibyte sequences, or other invalid UTF-8 byte sequences). User space is welcome to and encouraged to treat filenames as UTF-8 on modern systems, but that's not really enforced anywhere strongly.

What can I do with Japanese characters in Zip files that come out all messed up and Cyrillic when extracted under linux?
It depends, Japanese filenames are sometimes in an 8-bit codepage. There are several.

https://en.wikipedia.org/wiki/Code_page

https://stackoverflow.com/a/45583116

https://github.com/m13253/unzip-iconv

Thanks. unzip -O Shift_JIS <file> did the trick.