| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stonekyx 1219 days ago
	I think I observed this problem recently too, but with Japanese file names. Basically I had a Git repository and didn’t know about `git config core.precomposeUnicode`. And when the repository is synchronized to a Linux system, on the Linux side it can sometimes have 2 files with the same-looking file name but different normalizations. (Because I think ext4 doesn’t normalize Unicode?) That took me about an afternoon to fix.

2 comments

monocasa 1219 days ago

> Because I think ext4 doesn’t normalize Unicode?

Yeah, native Linux filesystems don't really know anything about unicode at all at their heart. They really work more on raw bytes, simply reserving 0x00 (NUL) and 0x2F ('/'). Anything else goes in a filename as far as the kernel is concerned (including evil stuff like incomplete multibyte sequences, or other invalid UTF-8 byte sequences). User space is welcome to and encouraged to treat filenames as UTF-8 on modern systems, but that's not really enforced anywhere strongly.

link

totetsu 1219 days ago

What can I do with Japanese characters in Zip files that come out all messed up and Cyrillic when extracted under linux?

link

actionfromafar 1219 days ago

It depends, Japanese filenames are sometimes in an 8-bit codepage. There are several.

https://en.wikipedia.org/wiki/Code_page

https://stackoverflow.com/a/45583116

https://github.com/m13253/unzip-iconv

link

totetsu 1219 days ago

Thanks. unzip -O Shift_JIS <file> did the trick.

link