Hacker News new | ask | show | jobs
by jbeda 2508 days ago
This is a great rundown. I started my career at Microsoft working with and on Win32/COM and saw this play out first hand.

One thing not mentioned here is the history of the "Byte Order Mark" (BOM) in unicode.

(Not an expert here but my understanding having lived it.)

You see, given UCS-2, there are 2 ways to encode any codepoint -- either big endian or little endian. The idea was then to create a codepoint (U+FEFF) that you could put at the start of a text stream that would signify what order the file was encoded in.

Wikipedia page: https://en.m.wikipedia.org/wiki/Byte_order_mark

This then got overloaded. When loading a legacy text format often times there is the difficulty of figuring out the code page to use. When applied to HTML, there are a bunch of ways to do it and they don't always work. There are things like charset meta tags (but you have to parse enough HTML to find it and then re-start the decode/parse). But often times even that was wrong. Browsers used to (and still do?) have an "autodetect" mode where it would try divine the codepage based on content. This is all in the name of "be liberal in what you expect".

Enter UTF-8. How can you tell if a doc is US ASCII or UTF-8 if there are no other indications in the content? How does this apply to regular old text files? Well, the answer is to use the BOM. Encode it in UTF-8 and put it at the start of the text file.

But often times people want to treat simple UTF-8 as ASCII and you end up with a high value codepoint in what would otherwise be an ASCII document. And everyone curses it.

Having the BOM littered everywhere doesn't seem to be as much of a problem not as it used to be. I think a lot of programs stopped putting it in and a lot of other programs talk UTF-8 and deal with it silently. Still something to be aware of though.

4 comments

Yeah, the BOM has gotten me a few times.

My most spectacular fail was a program that read UTF-8 or Latin-1 and wrote UTF-16, preserving but not displaying null characters. I believe this was default behavior HyperStudio. Every round-trip would double the size of the file by inserting null bytes every other character. Soon there were giant stretches of null characters between each display character, but the displayed text never appeared to change even though the disk requirements doubled with each launch. That's how I learned about UTF-16!

Speaking of Win32/COM... is there a "tcpdump for COM"? I've got a legacy app that uses COM for IPC and I've been instrumenting each call for lack of one.

If there is anything like tcpdump for COM, it would be part of Event Tracing for Windows, but you’d probably prefer to use it via Microsoft Message Analyzer.
A tcpdump for COM would be incredible... but I'm in the same boat of having just instrumented each call individually :(

I would guess though, that there is probably some pretty helpful code for this in the apitrace program that could probably be lifted out and reused, since DirectX APIs tend to involve a lot of COM. I haven't tried, though.

 is the representation of the UTF-8 BOM byte sequence in Latin-1. If this comment were stored as Latin-1 and you assumed it was UTF-8 just because it began with that byte sequence, you would discard an important part of my message.
Indeed, Windows Notepad does exactly that (ignores  and reads rest as UTF8)
I have sympathy for its authors. There is no way to really know what the right encoding is. Your options are to guess based on heuristics, allow the user to specify, or demand a particular format. Even the friendliest applications just guess and allow the user to override.
If you start using the BOM like that for UTF-8, then it's not really a byte order marker anymore.

BTW, OLE/COM is something that didn't click at all for me when I first started encountering it in the '90s. I'm kind of bummed that it seems to have been left behind because it's still a useful technology.

> How can you tell if a doc is US ASCII or UTF-8 if there are no other indications in the content?

If the document contains no codepoints above 0x7F, it is both US-ASCII and UTF-8 at the same time. If the document decodes as valid UTF-8, it’s more likely to be UTF-8 than whatever national encoding it might be (latin1? windows-1252? equivalents for other countries?)

With the BOM, you get more serious problems. Everyone on the way needs to know about it, and that it needs to be ignored when reading, but probably not when writing. I remember the olden days of modifying .php files from CMSes with notepad.exe, which happily, silently added a BOM to the file, and now suddenly your website displays three weird characters at the very top of the page.