| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tialaramex 1579 days ago

They're essentially identical problems. In Linux these strings are arbitrary bytes with a rule forbidding byte 0x00, in Windows they're arbitrary unsigned 16-bit integers, I don't know if 0x0000 is forbidden. Not a significant difference.

Rust's OsString provides a structure that has the potentially nonsensical raw data inside it, and offers ways to ask for:

* The Unicode text if that's what is actually encoded (or else None)

* The text after applying Unicode decoding rules and substituting U+FFFD (Unicode's "Replacement Character" �) where errors occur

* The actual raw bytes / 16-bit unsigned integers

If all your program cared about was a filename, this is not coincidentally also how these operating systems spell filenames, so you can just hand the OsString to an OS-level file API, to open it, rename it or whatever without caring if it is Unicode or not.

The U+FFFD option does expose one potential surprise on Unix flavour systems which doesn't exist on Windows because U+FFFD is one UTF-16 code unit, but it is three UTF-8 code units, so such a "decoded" input could exceed the size of some internal storage you'd assumed would never need to be larger than the command line maximum length.