Hacker News new | ask | show | jobs
by pdonis 3520 days ago
> Applications have for the most part proven that they cannot be trusted to get text encoding and decoding right, especially not in any consistent way.

That's because text encoding and decoding is a mess. Operating systems doing it doesn't make it any less of a mess; it just inserts the mess deeper into everything. For example, look at all the quirks and edge cases in file name handling between different OS's, simply because nobody is willing to just admit that to the OS, file names should be sequences of bytes, which are easy to share between machines running different OS's.

The basic issue is that text encoding and decoding exists because bytes have meanings. But unless/until we invent artificial intelligence, computers can't deal with meanings (because the meanings are not simple computable functions of the bytes). And OS's, particularly, should not even try. Applications might have to try, but the cost if they get it wrong is much less.

1 comments

Regardless of whether operating systems get involved in tasks like re-encoding text, they really should at least carry along the metadata about encodings whenever they're handling bytes that represent strings. Completely ignoring the problem and leaving it up to applications further up the stack just ensures that there will be incompatible competing standards for how to tell applications how to decode the string data they get from the OS. You don't want some apps trying to write filenames in UTF-8 while others use UTF-16, but allowing it to happen silently is even worse.
> Completely ignoring the problem and leaving it up to applications further up the stack just ensures that there will be incompatible competing standards for how to tell applications how to decode the string data they get from the OS.

I think it's naive to think Operating Systems aren't going to fragment in order to offer "features" (and lockin), and then papering over all that fragmentation has to happen in the application anyway.

unless there's a standard, and if there's a standard the application itself can deal with it.

> Regardless of whether operating systems get involved in tasks like re-encoding text, they really should at least carry along the metadata about encodings whenever they're handling bytes that represent strings.

I have no problem with this as long as the metadata itself is just additional bytes. But if the metadata needs to be decoded in order to figure out how to decode it, we have a problem... :-)

That's untenable. A higher-level API for strings with encodings needs to get the OS involved in the semantics to at least some extent, or else it merely obfuscates the problem instead of solving it. If the OS provides a way to store strings with a metadata field representing the string encoding, but doesn't define which bit pattern means UTF-8, then all of that extra complexity at best serves to call attention to the fact that encoding matters, but it does nothing to help applications ensure that they correctly interpret data created by a different application. If you're going to give your platform official APIs to address the very real problem of handling string encodings, then they ought to be useful enough to truly make it less of a problem. And since none of this actually precludes also including low-level byte-oriented APIs, there's no justification for stopping with a super-minimalist half-solution.
> If the OS provides a way to store strings with a metadata field representing the string encoding

You're missing my point. The OS should provide a way to store bytes. That's it. The meaning of the bytes is up to the application. If, to the application, the bytes represent text with a certain encoding, then it's up to the application to figure out how to translate the bytes, possibly using other stored bytes to decide. The OS doesn't need to get involved in any of this.

> it does nothing to help applications ensure that they correctly interpret data created by a different application

This is already a solved problem, and it isn't solved by OS's. It's solved by standards. For example, every web browser constantly has to correctly interpret data created by a different application. It can do so because HTML, CSS, JS, etc. are all standards that define how the bytes sent from the server to the client are to be interpreted. The browser doesn't even have to care what OS it's running on; all the OS is doing is giving it network sockets and a place for local data storage.

> If you're going to give your platform official APIs to address the very real problem of handling string encodings

If "platform" means "OS", then no, I'm not. If "platform" means "application framework", then sure, but an application framework is not the same thing as an OS. The fact that many OS's insist on also being application frameworks does not make the two things the same.

> If "platform" means "OS", then no, I'm not. If "platform" means "application framework", then sure, but an application framework is not the same thing as an OS. The fact that many OS's insist on also being application frameworks does not make the two things the same.

As I said originally, we tried that, and it doesn't work. Even within the context of a single locale, .NET applications will happily emit UTF-16 to be consumed by a Python script expecting all strings to be UTF-8, and with only byte-oriented APIs there's no side channel to convey that there's a mismatch that needs to be reconciled. Extending this problem from file and pipe contents to filenames is moving in the wrong direction. Operating systems absolutely should get involved in helping applications safely and usefully exchange information; that doesn't destroy the concept of an application framework, it just means that your OS is more than a hypervisor.

Very late post but I can't resist paraphrasing the old joke about regular expressions: some people, whenever they see an application-level problem, think "Oh, I'll just get the OS to solve it!" Now they have two problems.
That's an integration problem and the solution is to pipe it through something that knows enough to do the conversion.