Hacker News new | ask | show | jobs
by carra 58 days ago
I read that article long time ago, and for me it's a hard disagree. A system as complex and quirky as Unicode can never be considered "plain", and even today it is common for many apps that something Unicode-related breaks. ASCII is still the only text system that will really work well everywhere, which I consider a must for calling something plain text.

And yes, ASCII means mostly limiting things to English but for many environments that's almost expected. I would even defend this not being a native English speaker myself.

2 comments

I feel like that isn’t exactly a very useful definition of plaintext. If you mean “ASCII” say ASCII.

Plain text is text intended to be interpreted as bytes that map simply to characters. Complexity is irrelevant.

https://en.wikipedia.org/wiki/Plaintext

  With the advent of computing, the term plaintext expanded beyond human-readable documents to mean any data, including binary files, in a form that can be viewed or used without requiring a key or other decryption device. Information—a message, document, file, etc.—if to be communicated or stored in an unencrypted form is referred to as plaintext.
https://csrc.nist.gov/glossary/term/plaintext

    Unencrypted information that may be input to an encryption operation. Note: Plain text is not a synonym for clear text. See clear text.

    Intelligible data that has meaning and can be understood without the application of decryption.
Unfortunately no, Unicode is not simply a mapping of bytes to characters. It is a mapping of numbers to code points, and in some cases you can even get the same characters with multiple code point sequences (not a very good mapping!). Then you need to convert numbers to bytes, so aside from Unicode you also need an encoding. And there are multiple choices. So what would be "plain text" then? UTF-16? UTF-8? If so, with or without BOM? It can't be all of them. For something to really be "plain text" it has to be the same thing to everyone...
> Unfortunately no, Unicode is not simply a mapping of bytes to characters. It is a mapping of numbers to code points, and in some cases you can even get the same characters with multiple code point sequences (not a very good mapping!).

It is worse than that; you can also get different characters with the same code points, and also same code points and characters that should be different according to some uses, and also different code points and characters that should be same according to some uses, etc.

I agree with you that Unicode is too complicated and messy, although it also shows that whether or not something is considered "plain" is itself too difficult.

Unicode has caused many problems (although it was common for m17n and i18n to be not working well before Unicode either). One problem is making some programs no longer 8-bit clean.

Unicode might be considered in two ways: (1) Unicode is an approximation of multiple other character sets, (2) All character sets are an encoding of a subset of Unicode. At best, if Unicode is used at all, it should be used as (1) (as a last resort), but it is too common for Unicode to be used as (2) (as a first resort), which is not good in my opinion.

(I mostly avoid Unicode in my software, although it is also often the case (and, in many (but not all) programs, should be the case) that it only cares about ASCII but does not prevent you from using any other character encodings that are compatible with ASCII.)

> ASCII is still the only text system that will really work well everywhere, which I consider a must for calling something plain text.

Yes, it does work well (almost) everywhere.

Supersets of ASCII are also common, including UTF-8, and PC character set, ISO 2022 (if ASCII is the initially selected G0 set, which it is in the ASN.1 Graphic string and General string types, as well in most terminal emulators), EUC-JP, etc. In these cases, ASCII will also usually work well.

However, as another comment mentions, and I agree with them, that if you mean "ASCII" then it is what you should say, rather than "plain text" which does not tell you what the character encoding is. That other comment says:

> Plain text is text intended to be interpreted as bytes that map simply to characters.

However, it is not always so clear and simple what "characters" is, depending on the character sets and what language you are writing. And then, there are also control characters, to be considered, so it is again not quite so "plain".

> And yes, ASCII means mostly limiting things to English but for many environments that's almost expected. I would even defend this not being a native English speaker myself.

In my opinion, it depends on the context and usage. One character set (regardless of which one it is) cannot be suitable for all purposes. However, for many purposes, ASCII is suitable (including C source code; you might put l10n in a separate file).

You should have proper m17n (in the contexts where it is appropriate, which is not necessarily all files), but Unicode is not a good way to do it.

All character setes do not encode subsets of Unicode.

Two well-known counterexamples that come immediately to mind:

1. Mac OS Roman includes a non-Unicode Apple logo.

2. The Atari ST character set includes two non-Unicode characters that combine to create an Atari logo, and 4 non-Unicode characters that combine to create a picture of J.R. "Bob" Dobbs [1].

[1] https://en.wikipedia.org/wiki/J._R._%22Bob%22_Dobbs

Yes, I know that (I was aware of both of these cases, as well as others). In those cases, there are characters that do not correspond to any Unicode characters.

Nevertheless, what I was saying is that many programs seem to be designed as though other character sets do encode subsets of Unicode, but actually they are different character sets and are not Unicode.

However, what I meant was, in addition to things like the examples that you have, other less obvious cases. Even if characters do correspond to Unicode (and sometimes there is more than one way to do it, which is the case for several PC characters), they are not necessarily supposed to work in the same way.

At best, Unicode could be used as an approximation if the other character sets cannot be used, although sometimes there are other ways to do it.