Hacker News new | ask | show | jobs
by jerf 1330 days ago
"A type system isn't going to save you from users submitting all kinds of potentially different encodings."

Yes, it is, because you give that a type that indicates you don't know what the encoding is, like RawInput or something. You then can not pass this type to any other function that doesn't explicitly call for that type. If you have some function that accepts it, blindly casts it to UTF-8, and slams it out into a file, well, that's not the type system's fault [1].

Of course a type system won't prevent you from still just being wrong or writing bugs; nobody promises that, not even the formal methods advocates. But it will prevent you from just accidentally blindly shoveling it out somewhere it doesn't belong without ever examining it or thinking about it.

I think you may be believing in a popular myth about strong typing systems, that they are designed to somehow prevent bad data from coming in to your system at all. You correctly identify that as impossible. But what strong typing systems can do is force you to deal with the fact that bad data may be coming in. On the outside, you have the chaos of, say, a bag of bytes that may or may not be JSON. On the inside, you have a "type SomeStruct { int a; int b }". A strong type systems forces you to write some sort of adapting code between those two, and guarantees that the result of that adapting code will be only and exactly the type that comes out of that adapting code, no "whoops, sometimes this dynamic code just returns a string, or maybe a network socket, or who knows what". Nothing can prevent your HTTP API from receiving a JPG of an anime character instead of JSON specifying a user to delete, but a strong type system can make you deal with that immediately and fully, instead of garbage data of indeterminate type floating through the system for an indeterminate period of time.

[1]: Also note there are a lot of "strong type systems" in the world that still fail to take advantage of their own capabilities and let bare string types and such float around too much. There are reasons why libraries must support the lowest common denominator; a file is a series of bytes with no further constraints, so the lowest level API has no choice but to accept that, but higher level APIs should more often take more restricted types. That strong type systems can save you from this doesn't mean they all do. I have a number of wrapper types in various languages just to add these guarantees to my programs not provided by the underlying libraries, though I also have some code that just wraps the underlying libraries that can't help but correctly take raw bytes at the lowest level.

1 comments

>If you have some function that accepts it, blindly casts it to UTF-8

Unfortunately, if you interact with services you didn't write, you're usually back to getting "strings" of unknown encoding, and typically requirements that force some blind or semi-blind guessing.

Blind guessing is not related to the type system. Nobody has claimed type systems can solve that. What they can do is force you to guess, and make it clear where that is occurring.

This, again, goes back to a very broken understanding of types systems that I often see, and once held myself. The claim of type systems is not that they magically go out into the world and fix the external world to be well-typed; the claim is that it forces your code to deal with the conversion of the external world into a clean internal representation, and presumably, to have a clean error pathway when that fails. Dynamically-typed code will let you float along much more easily. Statically-typed code can still be written that way, but at least then it's poor statically-typed code. In some circles that sort of broken dynamic code is essentially idiomatic. (Though that is fading away as every year more programmers learn how bad an idea that is.)

I agree with that if you qualify it with "sometimes". Strong types can force you to guess, sometimes. Other times, the data fits the type but isn't the type.
If the language explicitly says how strings are defined, libraries that go "Eh, I'll just shove nonsense bytes in this data structure and claim that's a string" are broken by definition.

That's just as true in Java as in Rust. The problem is languages like C++ or D which just don't care and have a "string" type that might just be some bytes.

I don't mean libraries, I mean external services. Ambiguous strings are everywhere.
The libraries in question are the ones consuming the output of those external services. If an external service sends data that does not map to the programming language's string type, then the string type will fail to be created from that invalid input, and the library was wrong to have tried.
The external services are not always either explicit, or compliant about the content of said string. Follow this comment chain up, and you'll see mention of blinding casting to utf-8. The point I was making is that you don't always know what the encoding is.
If the service doesn't send you back a string it doesn't send you back a string. End of story. You're continuously trying to complicate the issue by insisting that they're sending strings and that the client needs to guess their encoding. They are not and it does not.