Hacker News new | ask | show | jobs
by tialaramex 1325 days ago
> Firstly, a string could be raw unknown bytes, verified UTF-8, or UCS-2 (or even UTF-16 or UCS-4), and you absolutely need to know which it is.

This is a language defect. If your language was invented in the 1960s it's an understandable defect, but it's still a defect. I do not want to write computer software with strings in a language that doesn't even have an actual string type rather than "Eh, maybe this is a string or maybe it's just some random bytes, who cares".

Only in very low level software should it make a difference whether the string is in fact represented as UTF-8 or UTF-16 or whatever, but Rust shows that you can write software at a low level and still enforce type safety for strings.

I agree though that here once again the Right Thing™ is a strong type system. If I've got a Microsoft Graph username, a URL, an email address and a UUID, that's four types, those are not four strings with human names to distinguish them. We don't need to escape some or any of these types - in their context.

2 comments

A type system isn't going to save you from users submitting all kinds of potentially different encodings. Which also depends on what kind of user input is being handled: Is it OS-provided UI? Is it something being sent to a service accessible on the internet? Is it from a CLI? Is it from a file? Context matters for the potential space of what kind of data you might be operating on, which could require different ways of either knowing what kind of data you have based on having more control over the input versus having to detect stuff (or be told, correctly) from highly arbitrary things like reading from a file. All of that is external to the type system, and requires doing something before you can tag it with the correct type. Some languages might attempt to detect this stuff for you, but that could potentially be considered a language defect if it's hard to detect what a string is without having other input telling you what that string contains, such as a header in an HTTP request saying that it's UTF-8.
"A type system isn't going to save you from users submitting all kinds of potentially different encodings."

Yes, it is, because you give that a type that indicates you don't know what the encoding is, like RawInput or something. You then can not pass this type to any other function that doesn't explicitly call for that type. If you have some function that accepts it, blindly casts it to UTF-8, and slams it out into a file, well, that's not the type system's fault [1].

Of course a type system won't prevent you from still just being wrong or writing bugs; nobody promises that, not even the formal methods advocates. But it will prevent you from just accidentally blindly shoveling it out somewhere it doesn't belong without ever examining it or thinking about it.

I think you may be believing in a popular myth about strong typing systems, that they are designed to somehow prevent bad data from coming in to your system at all. You correctly identify that as impossible. But what strong typing systems can do is force you to deal with the fact that bad data may be coming in. On the outside, you have the chaos of, say, a bag of bytes that may or may not be JSON. On the inside, you have a "type SomeStruct { int a; int b }". A strong type systems forces you to write some sort of adapting code between those two, and guarantees that the result of that adapting code will be only and exactly the type that comes out of that adapting code, no "whoops, sometimes this dynamic code just returns a string, or maybe a network socket, or who knows what". Nothing can prevent your HTTP API from receiving a JPG of an anime character instead of JSON specifying a user to delete, but a strong type system can make you deal with that immediately and fully, instead of garbage data of indeterminate type floating through the system for an indeterminate period of time.

[1]: Also note there are a lot of "strong type systems" in the world that still fail to take advantage of their own capabilities and let bare string types and such float around too much. There are reasons why libraries must support the lowest common denominator; a file is a series of bytes with no further constraints, so the lowest level API has no choice but to accept that, but higher level APIs should more often take more restricted types. That strong type systems can save you from this doesn't mean they all do. I have a number of wrapper types in various languages just to add these guarantees to my programs not provided by the underlying libraries, though I also have some code that just wraps the underlying libraries that can't help but correctly take raw bytes at the lowest level.

>If you have some function that accepts it, blindly casts it to UTF-8

Unfortunately, if you interact with services you didn't write, you're usually back to getting "strings" of unknown encoding, and typically requirements that force some blind or semi-blind guessing.

Blind guessing is not related to the type system. Nobody has claimed type systems can solve that. What they can do is force you to guess, and make it clear where that is occurring.

This, again, goes back to a very broken understanding of types systems that I often see, and once held myself. The claim of type systems is not that they magically go out into the world and fix the external world to be well-typed; the claim is that it forces your code to deal with the conversion of the external world into a clean internal representation, and presumably, to have a clean error pathway when that fails. Dynamically-typed code will let you float along much more easily. Statically-typed code can still be written that way, but at least then it's poor statically-typed code. In some circles that sort of broken dynamic code is essentially idiomatic. (Though that is fading away as every year more programmers learn how bad an idea that is.)

I agree with that if you qualify it with "sometimes". Strong types can force you to guess, sometimes. Other times, the data fits the type but isn't the type.
If the language explicitly says how strings are defined, libraries that go "Eh, I'll just shove nonsense bytes in this data structure and claim that's a string" are broken by definition.

That's just as true in Java as in Rust. The problem is languages like C++ or D which just don't care and have a "string" type that might just be some bytes.

I don't mean libraries, I mean external services. Ambiguous strings are everywhere.
The libraries in question are the ones consuming the output of those external services. If an external service sends data that does not map to the programming language's string type, then the string type will fail to be created from that invalid input, and the library was wrong to have tried.
The way Rust does it is IMO interesting. There is e.g. an OsStr for strings that e.g. describe filenames in an directory listing, because these could actually be invalid UTF-8 but your program might still need to be able to handle them.

So when you wanna convert that OsStr to a String you are forced to handle this in one way or another. This is less comfortable, but describes the underlying systems more accurately.