|
|
|
|
|
by WalterBright
1777 days ago
|
|
One technique I try first is to write code that cannot fail. For example, a sort function should never fail. Consider the case of running out of memory. One option is to pre-allocate all the memory the algorithm will need, then it can't run out of memory. Another option is to regard out-of-memory as a fatal error, not one that needs to be thrown and caught. Another example is UTF-8 processing. Early on, I did the obvious when invalid UTF-8 sequences were discovered - throw an exception. But this got in the way of high speed string processing (exceptions, even in the happy path, are slow). But what does one do anyway with such input? abort the display of the text? Nope. The bad sequence gets replaced with the Unicode "replacement character". This turns out to be common practice, and now my UTF-8 processing code cannot fail! And it's smaller and faster, too. It's a fun challenge to figure out how to organize the program so it can't fail. |
|
This is in practice almost invariably the case for large programs. Somebody (Herb Sutter maybe?) asked the major C++ Standard library implementers, and none of them really bothers to handle the tricky parts of this. If you write code to try to pre-allocate a 10TB vector of 'Z's you can probably get that to throw you the exception that you read about in the documentation, but if the library code for opening a file can't find 64 bytes for a temporary object they aren't going to bubble up an exception, they're going to crash your program and too bad.
If you write an operating system kernel, you care about running out of memory, if you write the embedded firmware for a jet engine, you care (actually you likely never allocate memory at runtime, so in that sense you don't care), but in both those cases you live in a world where many other problems are far above you out of sight, so you can afford to care about stuff like how much RAM there actually is. You don't want the C++ standard library down where you live, and they don't want your problems. Everybody who lives up above the C++ standard library doesn't care, which is why the people implementing the library don't care either.
Yes, all of Unicode processing should use U+FFFD (the replacement character). Not just UTF-8, if you have any reason to do anything Unicode related and you're in a state where other paths forward are nonsense, emit U+FFFD. Take XML. Because the people involved hated ASCII control codes XML says you can't express them in XML 1.0 (which you will in practice have to use). I don't mean they need to be escaped, I mean you intentionally cannot express them. So if you have some arbitrary ASCII text that might include control codes, you can't write that as valid XML. What to do? Emit U+FFFD whenever this problem arises. Your users go "Huh, my Vertical Tab turned into this weird character in the XML output" and you send them to talk to the XML committee which will tell them they're a sinner and must repent of the evil of Vertical Tab and now your user knows you aren't crazy and maybe they stop using XML or maybe they don't but either way your code works.