Hacker News new | ask | show | jobs
by heyplanet 2250 days ago
I think UTF-8 was a mistake.

It is a pain in the ass to have a variable number of bytes per char.

In Ascii, you could easily know every character personally. No strange surprises.

Also no surprises while reading black on white text and suddenly being confronted with clors [1].

[1] Also no surprises when writing a comment on HN like this one and having some characters stripped. I put in a smiley as the firs "o" in colors, but it was stripped out. Looks like the makers of HN don't like UTF-8 either.

6 comments

You're conflating code points and some encoding; more importantly, you're conflating "array of encoded objects (bytes)" for "a string of text". They're not — and never have been — the same.
> It is a pain in the ass to have a variable number of bytes per char.

Maybe, but nobody can stomach the wasted space you get with UTF-32 in almost every situation. The encoding time tradeoff was considered less objectionable than making most of your text twice or four times larger.

And as the article points out, even then you might have more than one code point for a character.

> For example, the only way to represent the abstract character ю́ cyrillic small letter yu with acute is by the sequence U+044E cyrillic small letter yu followed by U+0301 combining acute accent.

You can't even write proper English in ASCII. ASCII is an absolute dead end. It's history.

Actually representing human language is HARD. It is also absolutely necessary. Whatever solution you choose is going to be complicated, because it is solving a very complicated problem.

Throwing your hands up and going "oh this is too hard, I don't like it" will get you nowhere.

You can't write proper snooty English in ASCII, with diaereses and whatnot.
1967 ASCII anticipated that, with dual-use character shapes so you could type o BS " → ö

But then people invented video terminals that didn't overstrike.

ASCII doesn't have have all the punctuation regularly used in English.
ASCII doesn't have a direct representation of all the punctuation used in English print, like 66 99 quotes, and different kinds of dashes (distinct from minus). For non-print, it's entirely fine.

Typesetting should be handled by a markup language anyway. Adding a few characters to Notepad doesn't create a typesetting system. A typesetting system needs to be able to do kerning, ligatures, justification. Not to mention bold, italics, and different fonts.

Why would print be different here? A screen is as much "print" as a paper is these days.

Choosing correct punctuation is not typesetting, either.

> It is a pain in the ass to have a variable number of bytes per char.

This is from API & language mistakes more than an issue with UTF-8 itself.

If you actually design your API & system around being UTF-8, like Rust did, then there's really no issue for the programmer. The API enforces the rules, and still gives you things like a simple character iterator (with characters being 32-bit, so that it actually fits: https://doc.rust-lang.org/std/char/index.html). The String class handles all the multi-byte stuff for you, you never "see" it: https://doc.rust-lang.org/std/string/struct.String.html

Retrofitting this into existing languages isn't going to be easy, but that's not an excuse to not do it at all, either.

Character (code point) iterators are useless.

For parsing text-based formats, UTF-8 has the nice property that the encoded byte sequence of a character is not a subsequence of the encoding for any other chracter or sequence of other characters. This means splitting on byte sequences of UTF-8 works just as well as spliting on code points.

And for text editing you need to deal with grapheme clusers anyway, which can be made up of a variable number of code points - so having these be made up of a variable number of bytes doesn't make anything worse.

> It is a pain in the ass to have a variable number of bytes per char.

In the same vein it's a pain in the ass to write everything in assembler. Which is why we don't do that, we use high-level languages instead.

Certain things such as DNS, email addresses and so on should be restricted to ASCII, it’s a security nightmare otherwise.
I assume you mean a limited subset of 7bit ascii ? 33-126

    % host -t a $'\015'.
    1 \015:
    19 bytes, 1+0+0+0 records, response, authoritative, nxdomain
    query: 1 \015
    %
It's not as straightforward or sensible as you think. It's case insensitive; it's case preserving; and C0 control characters, SPC, and DEL are allowed. The case differentiating bits for letters are nowadays sometimes used in an attempt to foil attackers. If you want things to look back on and say "I think that X was a mistake." then forget UTF of any stripe. The DNS is full of them.
I thought DNS allowed any arbitrary byte sequence as label (up to max length limit)