| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tialaramex 1388 days ago

While researching this comment I read some of the D library documentation and found what I think is probably a docbug at this URL:

https://dlang.org/phobos/std_utf.html#.byUTF

"Throws: UTFException if invalid UTF sequence and useReplacementDchar is set to UseReplacementDchar.yes"

My guess is that this is a mistake and should instead say UseReplacementDchar.no since it makes sense to throw an exception if you can't use U+FFFD here, rather than do both.

Anyway, in my view this is bad the same way the Billion Dollar Mistake is bad, and Rust made the right choice here. Arrays of stuff are great, but they aren't strings. Having to sprinkle "or maybe not" cases all over these libraries because of course these might not really be strings, results in exception fatigue from your developers, which in turn results in lower quality software and more effort for the conscientious developers who stick it out.

D's strings are less stupid than C's (and thus some of the C++ strings) but they're still just arrays which are maybe but maybe not actually text.

3 comments

WalterBright 1388 days ago

Thanks for the bug report. I filed it for you: https://issues.dlang.org/show_bug.cgi?id=23405

Having string be a magic builtin type does not eliminate the problem of dealing with invalid UTF sequences.

Invalid UTF sequences are inherent to the Unicode design, and programmers are left on their own to deal with it. The options are:

1. ignore them

2. use the replacement char

3. throw an exception (or other error indication)

D enables the programmer to pick which they need, on a case by case basis.

link

tialaramex 1379 days ago

> Thanks for the bug report. I filed it for you: https://issues.dlang.org/show_bug.cgi?id=23405

#23405 was resolved as fixed a week ago. It isn't fixed. I guess at least I didn't waste my time filing the bug.

link

tialaramex 1387 days ago

The problem does need solving, but it only needs solving once. D's approach means the programmers needs to make this decisions over, and over, and over again everywhere they have an alleged "string". Or they must track somehow (by convention perhaps?) whether string A is or is not "really" a string.

If you have type safety, you can make the choice just once.

Rust's String::from_{utf8,utf16}_lossy turn valid UTF-8/16 sequences into strings, and "fix" invalid ones with U+FFFD

Meanwhile String::from_{utf8,utf16} attempt the same but with an Err instead of replacement on failure if that's what the programmer wants.

Imagine if all D's numeric functions took the same attitude as its string functions, insisting on being passed arrays of bytes so that each function can parse those bytes, decide if this is actually a 16-bit unsigned integer (for example) and if so do what's expected otherwise perhaps return an error. We'd spot right away that this was not a practical design.

D's choices here are conventional, but I've come to expect a lot more and so I'm disappointed when I can't have it.

link

WalterBright 1387 days ago

I don't see the difference here. D offers the same options when processing a string.

link

tialaramex 1387 days ago

That's surely the whole point, every D std.string function is also a string decoder with varying features. But a suitably decoded "string" is still just the same type, whereas Rust has a distinct type for actual UTF8 strings

link

wtetzner 1387 days ago

I think the point is that you run the unicode validation once on your [u8] array, which gives you a &str (or String for the lossy variants). From then on, you know you have valid unicode and don't need to keep checking.

link

glandium 1387 days ago

On the other hand, the sad reality is that even when you have a plethora of string types to accommodate with reality like Rust, people will just not care out of convenience. See how Rust build scripts communicate paths to cargo via stdout, and how most of them just use Path::display (or something similar or worse) to do that, which is lossy. Rustc itself doesn't handle paths correctly either. IIRC, all in all, it's basically impossible to compile Rust code from a non-UTF-8 path.

link

acehreli 1388 days ago

D's string is not text by itself because it is an array of UTF-8 code units. However, we have this infamous feature called auto-decoding in the standard library that presents strings as unicode code points.

On the other hand, D's dstrings are more like text because they are not only UTF-32 but also random-accessible code points. (D does not address multiple representations of graphemes at language level. For example, at language level, ğ is different from "g and combining breve" but there are std.uni and std.utf modules that help.)

link

tialaramex 1388 days ago

> D's string is not text by itself because it is an array of UTF-8 code units.

Bytes. It's an array of bytes. D's char type isn't actually restricted to UTF-8 code units, char x = '\xFF'; works just fine even though that's not UTF-8.

link

acehreli 1388 days ago

I see what you mean but array of bytes is something else in D: byte[].

link