| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Latty 1638 days ago
	Unix filenames are just sequences of bytes, not defined as strings. Most programs parse them as utf-8, but there is nothing mandating that. Obviously that leads to problems.

2 comments

ninkendo 1638 days ago

One pedantic qualification: any byte except 0x2f (`/`) or 0x00.

This actually rules out nearly any non-UTF8 character set (besides ASCII.)

Quote from Linus, which reminds me of Henry Ford’s “you can have any color you want, so long as it’s black”:

> And that one true format is UTF-8. End of story. If you try to talk to the kernel in UCS-2 or anything else, you _will_ fail.

https://lore.kernel.org/all/Pine.LNX.4.58.0402141827200.1402...

link

jcranmer 1637 days ago

> This actually rules out nearly any non-UTF8 character set (besides ASCII.)

It doesn't--pretty much any character set that has seen widespread use in the past few decades would be compatible. Any single-byte charsets that are ASCII compatible (such as most Windows CP* sets or the entire ISO-8859-* suite) would work. Most Asiatic charsets (e.g., EUC-JP, Shift-JIS, Big5, GBK) that use variable-width encodings follow the rule that characters in the 0x00-0x7f range are ASCII and subsequent characters in the 0x40-0xff range, and so are themselves compatible as well.

So actually the list of notable incompatible charsets is easier to write out: UTF-16, UTF-32, EBCDIC, and ISO-2022-* charsets (which are mode-switching).

link

ninkendo 1637 days ago

Eh, fair enough. While you’re correct, character sets that are “ascii, but something custom when the high bit is 1” are all just “ascii” to me, in that they are all mutually incompatible for anything other than the first 127 characters, and 8-bit encoding in general has been ubiquitous for nearly as long as ascii has been defined. (Meaning that when most people say “ascii”, they’re actually referring to one of those encodings in practice.)

Asiatic character sets are an interesting point though. I wonder how common they were at the time of what Linus wrote…

link

jcranmer 1637 days ago

> While you’re correct, character sets that are “ascii, but something custom when the high bit is 1” are all just “ascii” to me

Don't call them just "ASCII"--that only serves to confuse people. Call them 8-bit ASCII-compatible charsets if you need a collective noun, but note that they are very different.

> (Meaning that when most people say “ascii”, they’re actually referring to one of those encodings in practice.)

Having actually worked on charset handling, when most people say "ASCII", they mean "ASCII" and not anything else. If a document is labeled as ASCII, then generally it should be handled as Windows-1252. If a conversion function claims to convert ASCII to something else, and doesn't provide any error mechanism (which it really should), then it usually means ISO-8859-1 aka Latin-1 aka map each byte to the first 256 Unicode characters.

But I'd never see, e.g., a KOI8-R document referred to as ASCII, nor anything that claimed to be ASCII assumed to be a KOI8-R document.

> Asiatic character sets are an interesting point though. I wonder how common they were at the time of what Linus wrote…

https://4.bp.blogspot.com/-O4jXmTm7WWI/Tyw1As8jt7I/AAAAAAAAI...

At the time he wrote that, the main Asiatic charsets for Chinese and Japanese would have been more common than UTF-8. Maybe Korean as well, although Linus's message is around the time that UTF-8 overtook EUC-KR. In any case, anyone who knew anything about character sets at the time would have been well aware of Asiatic variable-width character sets.

link

ninkendo 1637 days ago

I appreciate your insight, but I just want to expand on one point:

> Having actually worked on charset handling, when most people say "ASCII", they mean "ASCII" and not anything else.

Approximately zero people are referring to a true, packed, 7-bit encoding when they say "ASCII". They're nearly always talking about an 8-bit character set, and in such cases, something must happen when the high bit is 1. (I've never seen one that plain ignores or uses error glyphs for characters >127, although you likely have more experience with this than I do.) This is why I said people are referring to one of these encodings in practice... because ascii is 7-bit, and approximately everyone is talking about some 8-bit encoding of one form or another.

I would definitely agree that most wouldn't call KO18-R "ascii", but they may use the term "ascii" to describe the first 128 characters of KO18-R. (Notwithstanding if it uses weird replacement characters like Shift_JIS does with the backslash and the yen sign.) This is the reason for my comment about how the weird "ascii + custom" all just feels like ascii to me... if you stay below 128 it literally is.

I'll modify my original statement thusly:

> This actually rules out nearly any character set that isn't compatible with ASCII.

And add an addendum that if you don't use UTF-8, you can't use unicode and will be stuck in code page/locale hell.

link

int_19h 1637 days ago

> I've never seen one that plain ignores or uses error glyphs for characters >127

Reporting an error is the default behavior if you try to decode such a string with the ASCII codec in Python and .NET, at the very least.

The first 128 characters of KOI8-R are, of course, ASCII (the "weird replacement characters" are, in fact, explicitly allowed!). But a file encoded in KOI8-R is only ASCII if it contains those first 128 chars.

> if you don't use UTF-8, you can't use unicode and will be stuck in code page/locale hell.

UTF-7 was a thing. It just turned out that nobody really needed it.

link

CRConrad 1634 days ago

> Having actually worked on charset handling, when most people say "ASCII", they mean "ASCII" and not anything else.

Most American people, maybe.

link

dylan604 1637 days ago

I see your pedantic and raise you: UTF-8 isn't a font though. It's a text encoding.

link

marklgr 1637 days ago

String bets not allowed, whatever their encoding ;)

link

amptorn 1637 days ago

> Unix filenames are just sequences of bytes, not defined as strings

"Write programs to handle text streams, because that is a universal interface except for filenames which are opaque binary"

link