Hacker News new | ask | show | jobs
by mort96 54 days ago
UTF-8 does not encode "European glyphs" in two bytes, no. Most European languages use variations of the latin alphabet, meaning most glyphs in European languages use the 1-byte ASCII subset of UTF-8. The occasional non-ASCII glyph becomes two bytes, that's correct, but that's a much smaller bloat than what you imply.

Anyway, what are you comparing it to, what is your preferred alternative? Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information and you can't mix languages in a text file? Or do you prefer using UTF-16, where all of ASCII is 2 bytes per character but you get a marginal benefit for Han texts?

2 comments

> Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information?

Yes. Note that this is already how Unicode is supposed to work. See e.g. https://en.wikipedia.org/wiki/Byte_order_mark .

A file isn't meaningful unless you know how to interpret it; that will always be true. Assuming that all files must be in a preexisting format defeats the purpose of having file formats.

> Most European languages use variations of the latin alphabet

If you want to interpret "variations of Latin" really, really loosely, that's true.

Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters. This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.

As someone who has been using Cyrillic writing all my life, I've never noticed this bloat you're speaking of, honestly...

Maybe if you're one of those AI behemots who works with exabytes of training data, it would make some sense to compress it down by less than 50% (since we're using lots of Latin terms and acronyms and punctuation marks which all fit in one byte in UTF-8).

On the web and in other kinds of daily text processing, one poorly compressed image or one JavaScript-heavy webshite obliterates all "savings" you would have had in that week by encoding text in something more efficient.

It's the same with databases. I've never seen anyone pick anything other than UTF-8 in the last 10 years at least, even though 99% of what we store there is in Cyrillic. I sometimes run into old databases, which are usually Oracle, that were set up in the 90s and never really upgraded. The data is in some weird encoding that you haven't heard of for decades, and it's always a pain to integrate with them.

I remember the days of codepages. Seeing broken text was the norm. Technically advanced users would quickly learn to guess the correct text encoding by the shapes of glyphs we would see when opening a file. Do not want.

> A file isn't meaningful unless you know how to interpret it; that will always be true.

There are multiple levels of meaning, though; character encoding is just one part of it. For example, a text file might be plain text, or HTML, or JSON, or a C source code, etc; a binary file might be DER, or IFF, or ZIP, etc; and then there will be e.g. what kind of data a JSON or DER or IFF contains and how that level of the data is interpreted, etc.

> Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters.

Whether or not they are identical to ASCII characters depends on the character set and on other things, such as what they are being used for; the definition of "identical" is not so simple as you make it seem. Unicode defines them as not identical, which is appropriate for some uses but is wrong for other uses. (Unicode also defines some characters as identical even though in some uses it would be more appropriate to treat them as not identical, too. So, Unicode is both ways bad.)

> This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.

I agree with that (although I think UTF-8 should not be used for Japanese either), but it isn't because of which characters are considered "identical" or not. There are problems with Unicode in general regardless of which encoding you use.

> ... (although I think UTF-8 should not be used for Japanese either) ...

The people putting up websites in Japanese disagree with you, it would seem. According to Wikipedia (in the Shift JIS article), as of March 2026 99% of websites in the .jp domain were in UTF-8, with only 1% being in Shift JIS.

Japan used to have two different encodings in common use, Shift JIS (usually used on Windows) and EUC-JP (more common on Unix servers). This resulted in characters being misinterpreted often enough that they coined the word mojibake to describe the phenomenon of text coming out completely garbled. These days, it seems Japanese website makers are more than happy to accept a slight inefficiency in encoding size, because what they gain from that is never having to see mojibake again.

If they are misinterpreted, it is because the character encoding is not declared properly.

I still sometimes see mojibake in Japanese web pages, but sometimes it works; if it works, it is because the character encoding is declared properly.

In my opinion, EUC-JP is a generally better encoding of JIS (especially in e.g. C source code, which should not use Shift-JIS but EUC-JP is OK), but Shift-JIS does have some benefits in some circumstances (such as making a character grid with one byte per character cell; if using Shift-JIS for a Pascal source code then you should use (* *) instead of { } for comments please).

> If they are misinterpreted, it is because the character encoding is not declared properly.

OR because the software is buggy, or making assumptions about encoding and not checking them (which also counts as "buggy", of course). You can declare the encoding all you like, it won't protect you against the stupid decisions that other people make in writing their software. (See Excel, for example).

Yes, if you declare your encoding properly, things should work. Most of the time. And if you're using any encoding that is not the worldwide default (which these days is UTF-8), then you definitely should declare the encoding. But you'll still occasionally hit badly-written software that doesn't even think about other encodings and doesn't handle them properly. The only defense against that situation, where you declare your encoding properly and it still doesn't work, is to just use the encoding that the software was written to expect, which is almost certainly the worldwide default.

UTF-8 does not require a byte order mark. The byte order mark is a technical necessity born from UTF-16 and a desire to store UTF-16 in a machine's native endianness.

The byte order mark has has no relation to code pages.

I don't think you know what you're talking about and I do not think further engagement with you is fruitful. Bye.

EDIT: okay since you edited your comment to add the part about Greek and cryllic after I responded, I'll respond to that too. Notice how I did not say "all European languages". Norwegian, Swedish, French, Danish, Spanish, German, English, Polish, Italian, and many other European languages have writing systems where typical texts are "mostly ASCII with a few special symbols and diacritics here and there". Yes, Greek and cryllic are exceptions. That does not invalidate my point.

Unicode could have just been encoded statefuly with a "current code page" mark byte.

With UTF and emojis we can't have random access to characters anyways, so why not go the whole way?

Yikes. That would lose the ability to know the meaning of the current bytes, or misinterpret them badly, if you happen to get one critical byte dropped or mangled in transmission. At least UTF-8 is self-syncing: if you end up starting to read in the middle of a non-rewindable stream whose beginning has already passed, you can identify the start of the next valid codepoint sequence unambiguously, and then end up being able to sync up with the stream, and you're guaranteed not to have to read more than 4 bytes (6 bytes when UTF-8 was originally designed) in order to find a sync point.

But if you have to rely on a byte that may have already gone past? No way to pick up in the middle of a stream and know what went before.

We've already lost all that with emojis and other characters in supplementary planes.
No, we haven't. You can start at any byte in a UTF-8 document and resume reading coherent text. If you start reading from the middle of a multi code point sequence, then the first couple of glyphs may be wrong, for example you may see a lone skin tone modifier rendered as a beige blob where the author intended a smiley face with that skin tone. But these multi code point sequences are short, and the garbled text is bounded to the rest of the multi code point sequence. The entire rest of the document will be perfectly readable.

Compare this to missing a code page indicator. It will garble the whole section until the next code page indicator, often the whole rest of the document. The fact that you're even comparing these two situations as if they're the same is frankly ridiculous.

A huge, central, part of UTF-8 design is that you can start decoding it from any arbitrary offset, it is self-aligning.
Unicode had support for language tag codepoints. They still exist but have long been deprecated. They were intended to deal with glyph variants, especially with regards to Han unification.