Hacker News new | ask | show | jobs
by snnn 783 days ago
Man, if English is the only human language in this world, who would need UTF-8? The other encodings exist because they are more efficient for the other languages. Especially, for the Chinese, Japanese, and Korean languages. UTF-8 takes 50% more space than the alternatives. To bad modern Linux systems only support UTF-8 locales.
2 comments

> To bad modern Linux systems only support UTF-8 locales.

Do they? On my system:

    $ grep _ /etc/locale.gen | grep -v UTF-8 | wc -l
    183
That's 183 non-UTF-8 locales that are available on my system. OK, I don't have any non-UTF-8 locales currently configured for use, but I don't have to install anything extra for them to be available. Just uncomment some configuration lines and re-run `locale-gen`.

https://manpages.debian.org/bookworm/locales/locale-gen.8.en...

But the reality is: most glibc functions like `dirname` could not handle non UTF-8 encodings, because some encodings (like GBK) have overlaps with ASCII, which means when you search an ASCII char(like '\') in a char array, you may accidentally hit a half of a non-English character. Therefore, people in Asia usually do not use the non UTF-8 locales.
Why would you search for an ASCII char like '\', in a char array containing non-ASCII-based text, on a system with a non-ASCII-based locale?
Because that's how "dirname(3)" is implemented in glibc, except it searches '/' instead of '\'. Here all character encodings share the same code.
But the byte '/' can never be part of any filename/dirname under a UNIX filesystem. Which kinda sucks generally for anyone wanting to use a charset like that, but doesn't it also mean that should never be a problem for `dirname()`?

I'm struggling to imagine how this failure would manifest. Can you give an example of how dirname() would fail? What combination of existing file/directory name, and usage of that function, would not work as expected?

Edit: I'm also a bit confused how this counts as being a problem for "modern Linux systems" - wouldn't it have always been a problem for all Unix-based OSs?

The other encodings mostly exist for historical reasons; efficiency is just not a huge factor in 2024.