Hacker News new | ask | show | jobs
by lstamour 2138 days ago
I probably don't need to say this, but it all depends. Many operating systems and GUI frameworks internally use UTF-16 because it was more common when they were built. Lots of old files use really obscure encodings. Sometimes you get a UTF BOM to identify UTF-16 and UTF-32, other times you don't. Then there are the pesky ways you can encode characters with HTML or XML entities, the occasional double-encoding of such, and so on.

When I worked with library records, I had to deal with text encodings that pre-dated SQL, though I suppose I should be thankful that ASCII existed by then so they were mostly ASCII compatible, but even today there are systems designed to output MARC-8 + UTF-8 as a fallback only when a MARC-8 character isn't available (MARC-21) instead of just using UTF-8.

I'll admit though, outside of MARC-8 and the various Unicode encodings, I'm having trouble thinking up systems that would still be incompatible today. Old documents, yes, absolutely would be encoded in different charsets, Windows still generally defaults to encoding in their Latin1 if I recall correctly, but most systems today do expect UTF-8 over the network at least, and UTF-16 for display perhaps...

Don't get me started on line endings though, and how many files use one, both, more than one ... and especially how much fun it can be with git repos cross-platform, or when automated tools use platform default line endings when they should be configurable, etc. CSV files that aren't properly escaped are also a special mini hell...

Data is never easy. :) And that's assuming it's written correctly - https://rachelbythebay.com/w/2020/08/11/files/

1 comments

Wanted to amend this list of character encodings with another one I came across recently -- GSM-7, used for SMS. https://en.wikipedia.org/wiki/GSM_03.38#GSM_7-bit_default_al... If a message you send includes other Unicode characters than that, including emoji, it will cost more to send and use UCS-2 encoding (which later became UTF-16).