| Generally you can safely treat text in an unknown encoding as UTF-8. Since you're expecting potential failures but want to press on anyway instead of causing an exception/ error you treat invalid sequences as U+FFFD the Replacement Character as you would in a language or API with no exception reporting mechanism. There are lots of pleasing aspects to this choice. It's ASCII compatible of course, so anything that was actually ASCII is still ASCII, anything that was almost ASCII is just ASCII with U+FFFD where it deviated. The replacement character resolutely isn't any of the specific things, nor any of the generic classes of thing you might be expected to treat differently for security reasons. It isn't a number, or a letter (of either "case"), it isn't white space, and it certainly isn't any of the separators, escapes or quote markers like ? or \ or + or . or _ or... ... yet it is still inside the BMP so it won't trigger weird (perhaps less well tested) behaviour for other planes. It's self-synchronising. If something goes wrong somehow, in a few bytes if there is UTF-8 or an ASCII-compatible encoding the decoder will synchronise properly, you never end up "out of phase" as can happen for some encodings. Most usefully, whatever you're now butted up against works with UTF-8 now. Maybe some day that'll get formally documented, maybe it won't. As the years drag on the chance of specifying _anything else_ shrink more, and the de facto popularity of UTF-8 means even if it's never formalised anywhere everybody will just assume UTF-8 anyway and you haven't to lift a finger. |
When I worked with library records, I had to deal with text encodings that pre-dated SQL, though I suppose I should be thankful that ASCII existed by then so they were mostly ASCII compatible, but even today there are systems designed to output MARC-8 + UTF-8 as a fallback only when a MARC-8 character isn't available (MARC-21) instead of just using UTF-8.
I'll admit though, outside of MARC-8 and the various Unicode encodings, I'm having trouble thinking up systems that would still be incompatible today. Old documents, yes, absolutely would be encoded in different charsets, Windows still generally defaults to encoding in their Latin1 if I recall correctly, but most systems today do expect UTF-8 over the network at least, and UTF-16 for display perhaps...
Don't get me started on line endings though, and how many files use one, both, more than one ... and especially how much fun it can be with git repos cross-platform, or when automated tools use platform default line endings when they should be configurable, etc. CSV files that aren't properly escaped are also a special mini hell...
Data is never easy. :) And that's assuming it's written correctly - https://rachelbythebay.com/w/2020/08/11/files/