Hacker News new | ask | show | jobs
by russellallen 2285 days ago
> Unicode 13.0 adds 5,930 characters, for a total of 143,859 characters. These additions include 4 new scripts, for a total of 154 scripts, as well as 55 new emoji characters.

So how far off is Unicode from being 'done'? At what point will they be able to stop adding characters and scripts?

6 comments

German, a European language that has been more or less standardized for several centuries, with a Latin-based alphabet, added a new letter (ẞ) to its alphabet in 2017. As long as that continues to happen, Unicode will have to add new characters, even if no more ancient scripts are discovered and no new writing systems are developed for currently unwritten languages.
https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

That letter, the capital eszett, has existed in German typefaces since at least 1905:

> Historical typefaces offering a capitalized eszett mostly date to the time between 1905 and 1930. The first known typefaces to include capital eszett were produced by the Schelter & Giesecke foundry in Leipzig, in 1905/06. Schelter & Giesecke at the time widely advocated the use of this type, but its use remained very limited.

Eszett is usually just a lower-case form; it is most often upper-cased to SS, or two capital letter esses, being an example of how changing case does not always preserve the number of letters in a text. Capital eszett is very rare, but it was in uncommon usage in German text, and so it was added to Unicode.

Small precision for those who don't know the context: the Eszett (which comes from the ligature of 'ss') existed for centuries already in German writing. 2017 is just the date of its official integration in the alphabet, so it's not a 'new' letter created from scratch. I say that because I remember learning it at school a few decades ago (even if at the time we were warned the subject was touchy), and I was surprised it wasn't standardized that earlier.
The Eszett (ß) has been already standardized for several decades (since 1986: with ISO 8859-1 aka latin-1).

The newly added letter is the "capital letter Eszett", which did not exist until recently. One could argue that this new letter is not really needed, as Eszett does not appear in capitalized form except when a word is in all-caps, and was then simply written as "SS".

The capital version has existed in uncommon use since the early 1900s.
Standardisation was earlier than that :-) ISO-646 is the 7-bit predecessor to the 8-bit ISO-8859. It dates back to the late 1960s. Like 8859, 646 has several variants, such as ISO-646-DE which has ß where ~ is in ascii. (The trigraphs in C are partly to work around transcoding issues between ISO-646 and EBCDIC variants.)
Thanks for info, I didn't know that.
And some more precision: this is about the uppercase Eszett. The lowercase variant has existed for ages, and before 2017 was officially uppercased into SS or SZ.
It still is officially capitalized to SS under normal circumstances; capital ẞ is an allowed alternative: "Bei Schreibung mit Großbuchstaben schreibt man SS. Daneben ist auch die Verwendung des Großbuchstabens ẞ möglich." (Deutsche Rechtschreibung § 25 E 3)

SZ hasn't officially been an option at least since 1996.

> Eszett (which comes from the ligature of 'ss')

ß is both visually and from its name (‘s’ ‘z’) a ligature of s and z (“Eszett” in English is “ess-zed). An ss ligature would look like a double integral sign. “Sz” seems like a better way to represent that sound, so I don’t know why it morphed into “ss”

Greek has two characters for s, one for use in the middle and one for the ends of words. English lost this in the late 18th or early 19th century (look in the Declaration of Independence for examples). German kept it longer, at least in Fraktur, which included other standard ligatures, even in handwritten text, such as ch, tz et al. The Umlaut mark can also be considered a ligature for E which is how it was originally drawn.

This is about capital ẞ. Small ß must've been in Unicode from the start.
> So how far off is Unicode from being 'done'?

Do you think human written language is ‘done’ and will never evolve?

I think that most of the changes in Unicode 13 are not from the evolution of human written language. I don't know anyone who's ever written "blueberries" by drawing a picture of some blueberries in the middle of their text.
From the perspective of a person who uses an alphabetical language, such as English, sure Unicode can be "done". But if your language is based on ideograms, like Chinese, then it'll never be "done". As words are created they need to be encoded.
Again, that's great and I understand that (I've studied Japanese), but that's only part of the new version. They're not adding pictures of "mousetrap" and "olives" and "toilet plunger" because any existing language needs to write these.

Furthermore, I'm really starting to question the way CJK is encoded. We don't make every English word a separate codepoint. 97% of these CJK ideographs are just different combinations of the same few radicals. Korean seems especially weird, as they have both individual radicals and every precomposed triple (in a block that's been rearranged once or twice, on the basis that nobody was really using it yet). I'm not saying we should nix all precomposed Hanzi/Kanji, exactly, because that's a very convenient way for programs to handle text, but it seems like this system is becoming increasingly awkward for non-western languages.

I feel there's a fundamental flaw when our "universal" text encoding system can't handle the regular creation of new words in a well-understood way, for languages spoken by 1/3rd of the world's population. It's like we're issuing hardware patches for a software problem.

It is, I do not like the way CJK is being doubt with. Not to mention fonts dont include All the CJK variants of the fonts when I use the same word but need a JK variant because that is how it was suppose to be used.

Even the "C" has traditional and simplified variant.

Fortunately I think Unicode is pretty much done for Alphabetical languages. Someday if CJK design Unicode isn't good enough breaking it off to something better isn't entire impossible.

Emojis literally are an evolution in human written language. They started with youth texting and are now showing up in business emails. I predict that within 50 years we'll see emojis as a routine component of New York Times articles.
Now you're getting into the definition of "writing". I would say I've only seen emoji typed, not written. (Before anyone asks: yes, I've seen cuneiform written. I have some interesting friends.) If you count any visual communication that is typed on a phone under the greater umbrella of "writing", then we could also include colors, styles, orientation, funny fonts, image memes, animation, etc. There's no end to the possible visual communication that people might want to transmit digitally.

Where do you draw the line? I draw it at "anything in or using a language that people might write in the absence of computers, which they would then reasonably want to store and transmit using a computer". I don't include "any possible visual communication that can occur using a computer". That's far too broad to define "text", or be part of any existing "language", which are the stated goals of Unicode.

I'm not going to hold my breath on this one. Emojis have an air of informality that is not appropriate in many circumstances. Imagine writing a death notice with emojis
Well one could use U+1303F.

But you wait until Maya script gets into Unicode. You'll have at least three different Maya codepoints for death. (-:

I believe all scripts that are in extant use today are complete; if not, the missing extant scripts are down to the scripts with very few literate people (thousands or fewer).

Many of the additions today, barring emoji, are covering historical usage. This includes things like Medieval scribal annotations, a different set of numbers for the Ottomans, and the Mayan script. It will still be over a decade for the historical work to be complete, since there is often a lot of actual research that needs to be done to understand how an ancient writing system works, which has to come before you can even put together a coherent proposal for a new script.

I think that's a good framework to think about Unicode being "done". But even extant scripts aren't done; consider the Bopomofo additions here in Unicode 13. And it's not clear what "done" even means for Chinese characters.
Until there's a kumquat emoji then it will not be done:

* https://www.emojis.com/food/fruit/

There will always be another Emoji that someone, somewhere wants to add.

Well, it was a can of worms once they took the existing characters which for the most part were culturally very tied to Japan. Now everyone had access to those fun little icons and in turn a lot of people felt concepts from their cultural surrounding underrepresented.

And while every Unicode announcement gets derided because they added more emoji, it's still just a small subset of the standard. Back when they added them I wasn't much of a fan, but by now I think it was the right decision. I still don't use them, but I've heard they're quite popular in younger age groups.

The Script Encoding Initiative [0] is a UC Berkeley project to add Unicode support for uncommon and historical scripts. They have a list of remaining scripts which is encouragingly short [1], and most of the scripts on that list have proposals in progress so in theory they should be adopted soon

[0]: https://linguistics.berkeley.edu/sei/index.html [1]: https://linguistics.berkeley.edu/sei/scripts-not-encoded.htm...

Once people stop needing and/or inventing new characters and scripts.
"214 graphic characters that provide compatibility with various home computers from the mid-1970s to the mid-1980s and with early teletext broadcasting standards"

This part is dear to me, as I helped craft it. It includes 2x3 videotext mosaic characters that will make it much easier to draw large text have have better quality charts in text terminal interfaces.

And, of course, the ability to properly encode documents that were generated in computers in the 70's and 80's that contained those platform-specific characters.

For 14 we are planning on adding symbols from the Sharp MZ series and the large text characters (3x3 cells) of HP terminals.

The niche audience of terminal based games will also be thankful forever to you and the rest of the people that got those characters into Unicode.
They might be interested in Unscii, too. It has been updated in light of Unicode 13.

* http://pelulamu.net/unscii/ (https://news.ycombinator.com/item?id=18478350)

Down for me as is everything else.

Found a unicode consortium tweet (!) with a picture of the whole block:

https://twitter.com/unicode/status/1085613123183071232?lang=...

Do these... do these include the old Commodore 64 metacharacters/symbols??
You can read the explanation in the proposal, which will tell you everything, at http://www.unicode.org/L2/L2019/19025-terminals-prop.pdf .
Yes. PETSCII (and ATASCII) were some of the first characters that were added to the proposal (before I arrived at the group)
And emojis, don't forget about the all-important emojis.
Emoji: It's like Kanji, only without agreed upon semantic meaning or pronunciation. I pity historians of the future who have to try and decipher this garbage.
You are probably joking, but I am seriously annoyed there's no donkey emoji.
My wife and I call each other "donkey", and some years ago we used the horse emoji, which was low resolution enough to look as a donkey if you squinted. But modern emojis are too high resolution and it definitely looks like a horse now. So I feel your pain.
I wanted a pink pony.

My wifi SSID is <horse U+1F40E><unicorn U+1F984>, which is a barely satisfactory approximation.

(FWIW the FTP site is still up: ftp://ftp.unicode.org/Public/13.0.0/ )

Is U+130D8 not good enough for you?
If there's room in Unicode to describe "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE" there's room for "DONKEY".

I wonder why Gardiner's descriptions never made it into Unicode? https://en.wikipedia.org/wiki/List_of_Egyptian_hieroglyphs#E

O, wait, there are censored genitalia in D block ...

Thanks, I wasn't aware of it. Sadly I get square when I try to use it so less reliable than emoji.
either that or when there are 2^24 used codepoints.
The available space is closer to 2^20 (0-10FFFF, minus surrogate pairs, depending on whether you are talking about Unicode scalar values or code points).
There’s also Emoji modifiers (https://en.wikipedia.org/wiki/Miscellaneous_Symbols_and_Pict...) and regional indicators https://en.wikipedia.org/wiki/Regional_Indicator_Symbol that complicate determining the number of Unicode characters.