| This is a great rundown. I started my career at Microsoft working with and on Win32/COM and saw this play out first hand. One thing not mentioned here is the history of the "Byte Order Mark" (BOM) in unicode. (Not an expert here but my understanding having lived it.) You see, given UCS-2, there are 2 ways to encode any codepoint -- either big endian or little endian. The idea was then to create a codepoint (U+FEFF) that you could put at the start of a text stream that would signify what order the file was encoded in. Wikipedia page: https://en.m.wikipedia.org/wiki/Byte_order_mark This then got overloaded. When loading a legacy text format often times there is the difficulty of figuring out the code page to use. When applied to HTML, there are a bunch of ways to do it and they don't always work. There are things like charset meta tags (but you have to parse enough HTML to find it and then re-start the decode/parse). But often times even that was wrong. Browsers used to (and still do?) have an "autodetect" mode where it would try divine the codepage based on content. This is all in the name of "be liberal in what you expect". Enter UTF-8. How can you tell if a doc is US ASCII or UTF-8 if there are no other indications in the content? How does this apply to regular old text files? Well, the answer is to use the BOM. Encode it in UTF-8 and put it at the start of the text file. But often times people want to treat simple UTF-8 as ASCII and you end up with a high value codepoint in what would otherwise be an ASCII document. And everyone curses it. Having the BOM littered everywhere doesn't seem to be as much of a problem not as it used to be. I think a lot of programs stopped putting it in and a lot of other programs talk UTF-8 and deal with it silently. Still something to be aware of though. |
My most spectacular fail was a program that read UTF-8 or Latin-1 and wrote UTF-16, preserving but not displaying null characters. I believe this was default behavior HyperStudio. Every round-trip would double the size of the file by inserting null bytes every other character. Soon there were giant stretches of null characters between each display character, but the displayed text never appeared to change even though the disk requirements doubled with each launch. That's how I learned about UTF-16!
Speaking of Win32/COM... is there a "tcpdump for COM"? I've got a legacy app that uses COM for IPC and I've been instrumenting each call for lack of one.