| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by StefanKarpinski 1282 days ago

It's not at all obvious how it helps, but it does.

First, why is Python unable to represent invalid path names as strings? Because internally it converts strings from UTF-8, UTF-16, or any other encoding, to a fixed-with array of decoded Unicode code points. The width of integer used to represent code points is determined by the largest code point in the string: if the string is ASCII, it can use a byte (uint8) per character; if the string is non-ASCII but all BMP, then it can use a uint16 per character; otherwise it has to use uint32 per character.

Why does Python do all this? So that you can have O(1) character indexing. If you gave up on that, you wouldn't need to convert the string at all, you could just leave it as (potentially invalid) UTF-8 data.

Suppose you get an invalid path on UNIX where paths are UTF-8 by convention? What does Python do with this string? It can't convert it to an array of code points because invalid UTF-8 doesn't correspond to a code point (well, it can if it's just illegal, not malformed, but in general, we have to consider completely malformed strings that don't even follow the basic UTF-8 format). So Python is stuck: it can only replace the invalid data with something like the Unicode replacement character. But then you can't do anything useful with that because it's not the correct name of the path you're trying to work with.

How does using UTF-8 to represent strings help? Because you can represent invalid strings: just leave them as-is and don't try to decode them unless you have to. Sure, you can't decode them as code points, but that's actually a pretty unusual thing to do. If someone asks for decoding, _then_ you can give an error. What about Windows where paths are UTF-16 by convention? You can convert them to WTF-8 and everything works out. (Described in way more detail here: https://news.ycombinator.com/item?id=33984308).

3 comments

masklinn 1282 days ago

> How does using UTF-8 to represent strings help? Because you can represent invalid strings: just leave them as-is and don't try to decode them unless you have to.

That’s not UTF8. That’s a bag’o bytes which might be UTF8. Very different thing.

> Sure, you can't decode them as code points, but that's actually a pretty unusual thing to do.

It’s not, any unicode-aware text processing does it implicitly. This means any such processing has to either perform its own validation that the input is valid, or it may fly off the rails entirely if fed nonsense. This also increases risks if security issues, either outright UBs, or the ability to smuggle payloads through overlong encoding.

link

StefanKarpinski 1282 days ago

> That’s not UTF8.

True; I was careful not to call it that, but treating strings as UTF-8 by convention does make sense.

> It’s not, any unicode-aware text processing does it implicitly. This means any such things processing has to either perform its own validation that the input is valid, or it may fly off the rails entirely if fed nonsense.

In theory, but that's just not how most string operations actually work. If you have two UTF-8 strings and you want to concatenate them, you just concatenate the bytes. It would be ridiculously inefficient to decode the code points in each string and then re-encode them back into a destination buffer. If you have two UTF-8 strings and you want to see if one is a substring of the other and at what byte index, you just look for the bytes of one as a "substring" of the bytes of the other. Again, it would be ridiculously inefficient to decode the code points in each and do matching on code points. But what if the strings aren't valid UTF-8?! Both of those operations work just fine even if the strings aren't valid and produce sensible, intuitive results.

If you're implementing a browser or a terminal that has to actually display UTF-8 as characters then sure, you have to actually decode characters. Similarly, if you're parsing text somehow, then you have to decode characters. But many program only do concatenation and search and other operations like that which are actually implemented in terms of byte sequences, not characters.

link

masklinn 1282 days ago

> True; I was careful not to call it that

You specifically called it UTF8, repeatedly. The very comment I quoted asserts that "Utf8 would help deal with the issue [of garbage inputs]" (in its denial of the opposite assertion). You also did it in https://news.ycombinator.com/item?id=33986421

> If you have two UTF-8 strings and you want to concatenate them, you just concatenate the bytes.

That's not a unicode-aware operation, it's mostly a unicode-irrelevant operation (though unicode awareness can be useful in edge cases because of special grapheme clusters, but that's very task-specific).

> But what if the strings aren't valid UTF-8?! Both of those operations work just fine even if the strings aren't valid and produce sensible, intuitive results.

If your content is not actually UTF-8, you can end up with UTF-8, thus changing the semantics of the content. You can also end up with overlong UTF-8, which also changes the semantics of the content in a worse way.

link

StefanKarpinski 1282 days ago

The comment that you're quoting wasn't mine. In the comment you link to says "UTF-8 by convention". If either string is valid, then the result is as expected. If you're concatenating two strings that are both invalid UTF-8, there's not much you can do that's better than just concatenating the bytes together... which is exactly what treating them as byte arrays would end up doing (but it's less convenient). If you're worried about invalid UTF-8 you can check for validity (which again, is exactly what you end up doing if you use byte arrays).

link

masklinn 1282 days ago

> The comment that you're quoting wasn't mine.

The comment I originally quoted was yours. The second quote is adapted from a statement you denied. It is thus your statement. Let me put both sections together since you seem unwilling to do so:

>> Utf8 would not help with the issue in the article in any way.

> It's not at all obvious how it helps, but it does.

So you are stating, unambiguously, that "Utf8 would help deal with the issue [of garbage inputs]".

> In the comment you link to says "UTF-8 by convention".

The comment I link says:

> Treating paths as UTF-8 works very well

Which is either

1. wrong

2. nonsensical, given the later statement that you should not "require your UTF-8 strings to be valid", which would make them not UTF-8

> If you're concatenating two strings that are both invalid UTF-8, there's not much you can do that's better than just concatenating the bytes together... which is exactly what treating them as byte arrays would end up doing

But that's the point innit? You're asserting semantics which don't hold and which you break with no regard.

> (but it's less convenient).

Is it now? Here's the concatenation of two strings:

    a + b

here's the concatenation of two byte arrays:

    a + b

You're right, the inconvenience makes me shudder. What horror. What indignity.

link

adgjlsfhk1 1282 days ago

The problem with using strict UTF-8 for paths is that paths aren't guaranteed to be valid UTF-8. How do you want to write a program that opens a path who's name is invalid UTF-8?

link

masklinn 1282 days ago

> The problem with using strict UTF-8 for paths is that paths aren't guaranteed to be valid UTF-8.

Ok but I’m not saying to do that. I’m saying if you have not-utf8 strings don’t call them UTF8.

> How do you want to write a program that opens a path who's name is invalid UTF-8?

That’s not my problem given I’m not advocating for that.

link

StefanKarpinski 1282 days ago

The issue is that when you're implementing something like a programming language or a robust general purpose utility, then simply not being able to open—or list or remove or stat—paths with invalid names is not really acceptable.

link

masklinn 1282 days ago

Do you actually read comments before replying?

link

ilyt 1282 days ago

> Why does Python do all this? So that you can have O(1) character indexing. If you gave up on that, you wouldn't need to convert the string at all, you could just leave it as (potentially invalid) UTF-8 data.

Seems like "giving up" would've been better choice, considering just how rare operation that is. Or alternatively doing the conversion lazily the first time operation needing runes instead of bytes happen.

Most string operations are not accessing string by index and most of them even at O(n) would be fast enough because n is small. Like in typical "get a file name, extract some info from it", you're doing extraction once and anything after that doesn't need character indexing, because you already got the relevant data.

link

zokier 1282 days ago

> How does using UTF-8 to represent strings help? Because you can represent invalid strings: just leave them as-is and don't try to decode them unless you have to. Sure, you can't decode them as code points, but that's actually a pretty unusual thing to do. If someone asks for decoding, _then_ you can give an error

How is that better than just handling paths as `bytes`?

link