| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by carapace 3208 days ago

Step One: Admit there's a problem.

I heard, "Tell me more about what you think would be better." Here goes:

For written languages that are well-served by a simple sequence of symbols (English, etc.) there is no problem: a catalog of the mappings from numbers to pictures is fine is all that is required. Put them in a sequence (anoint UTF-8 as the One True Encoding) and you're good-to-go.

For languages that are NOT well-served by this simple abstraction the first thing to do (assuming you have the requisite breadth and depth of linguistic knowledge) is to figure out simple formal systems that do abstract the languages in question. Then determine equivalence classes and standardize the formal systems.

Let the structure of the language abstraction be a "first-class" entity that has reference implementations. Instead of adding weird modifiers and other dynamic behavior to the code, let them be actual simple DSLs whose output is the proper graphics.

Human languages are like a superset of what computers can represent.

Here's the Unicode Standard[1] on Arabic:

> The basic set of Arabic letters is well defined. Each letter receives only one Unicode character value in the basic Arabic block, no matter how many different contextual appearances it may exhibit in text. Each Arabic letter in the Unicode Standard may be said to represent the inherent semantic identity of the letter. A word is spelled as a sequence of these letters. The representative glyph shown in the Unicode character chart for an Arabic letter is usually the form of the letter when standing by itself. It is simply used to distinguish and identify the character in the code charts and does not restrict the glyphs used to represent it.

They baldly admit that Unicode is not good for drawing Arabic. I find the phrase "the inherent semantic identity of the letter" to be particularly rich. It's nearly mysticism.

If it is inconvenient to try to represent a language in terms of a sequence of symbols, then let's represent it as a (simple) program that renders the language correctly, which allows us to shoehorn non-linear behavior into a sequence of symbols.

If you think about it, this is already what Unicode is doing with modifiers and such. If you read further in the Unicode Standard doc I quoted above you'll see that they basically do create a kind of DSL for dealing with Arabic.

I'm saying: make it explicit.

Don't try to pretend that Unicode is one big standard for human languages. Admit that the "space" of writing systems is way bigger and more involved than Latin et. al. Study the problem of representing writing in a computer as a first-class issue. Publish reference implementations of code that can handle each kind of writing system along with the catalog of numbered pictures.

From the Unicode Standard again:

> The Arabic script is cursive, even in its printed form. As a result, the same letter may be written in different forms depending on how it joins with its neighbors. Vow-els and various other marks may be written as combining marks called tashkil, which are applied to consonantal base letters. In normal writing, however, these marks are omitted.

Computer systems that are adapted to English are not going to work for Arabic. I'd love to use a language simpler than PostScript to draw Arabic! Unicode strings are not that language.

Consider the "Base-4 fractions in Telugu" https://blog.plover.com/math/telugu.html

The fact that we have a way to represent the graphics ౦౼౽౾౸౹౺౻ is great! But any software that wants to use them properly will require some code to translate to and from numbers in the computer to Telugu sequences of those graphics.

Let that be part of "Unicode" and I'll shut up. In the meantime, I feel like it's a huge scam and a kind of cultural imperialism from us hacker types to the folks who are late to the party and for whom ASCII++ isn't going to really cut it.

To sum up: I think the thing that replaces Unicode for dealing with human languages in digital form should:

A.) Be created by linguists with help from computer folks, not by computer folks with some nagging from linguists (apologies to the linguist/computer folk who actually did the stuff.)

B.) We should clearly state the problems first: What are the ways that human language are written down?

C.) Write specific DSLs for each kind of writing. Publish reference implementations.

I think that's it. Are you informed? Persuaded even? Entertained at least? ;-)

[1] http://www.unicode.org/versions/Unicode9.0.0/ch09.pdf

5 comments

simonh 3208 days ago

That's a really good explanation of your position and reasons for it, thanks you.

>They baldly admit that Unicode is not good for drawing Arabic.....I'd love to use a language simpler than PostScript to draw Arabic! Unicode strings are not that language.

Unicode isn't good for drawing anything. Unicode is not intended to, or try to encode how a text should be displayed. At all, even slightly. This is the root of my disagreement with your post. You're claiming it can't accurately render the appearance of text, but that simply isn't it's purpose. It is purely and only about encoding the graphemes. Glyphs are what fonts and display technologies like PostScript are for, not Unicode.

You could argue that it should do that, perhapse Unicode should be a vector drawing language or something, but it's hard to see how that would make it useful for text processing that does concern itself with graphemes and grapheme like units. Unless the display oriented system you want contained within it a grapheme encoding system like Unicode to facilitate that - but then why not work the other way around and use Unicode for that and build a display system on top of Unicode to address your concerns?

I think trying to have your cake and eat it with a family of distinct DSLs would be problematic. Text processing is bad enough, but how would you process the content of a string that is actually a DSL? With Unicode it's possible to write a library that can process text in any script, even ones not in the standard yet, but if text could consist of computer code in any one of thousands of different domain specific languages, how would you ever be able to write one piece of code to work with all of them and all possible future permutations? Finally if your DSL is producing display output, how does that work with fonts? What if you want to vary the appearance of the output, how do you apply that to the encoding output? It just seems that this approach produces an enormous monolithic super-complex rabbit hole with no bottom in sight.

link

carapace 3207 days ago

> That's a really good explanation of your position and reasons for it, thanks you.

Cheers, I've had time to think and some sleep. I apologize to you and the people I've offended with my cranky trollish manner.

> Unicode is not intended to, or try to encode how a text should be displayed.

This made realize "text" traditionally is exactly language that is displayed somehow. The whole concept of storing writing as digital bits is metaphysical. Barely so for e.g. English, but quite a lot for e.g. Arabic.

> [Unicode] is purely and only about encoding the graphemes.

If it's just a catalog mapping numbers to little pictures (technically to collections, or families, of glyphs, or even to non-specific heuristics for deciding if a graphical structure counts as a glyph for a grapheme [1]) then I'll shut up. But what about the modifiers and stuff?

Maybe I am being unfair to Unicode. I don't want to deny or denigrate the cool and useful things it actually does do. As I said I think it's a combination of a good idea (encoding graphemes) with an impossible idea (encoding written human languages). If Unicode isn't the latter then I've been shouting at the wrong cloud!

- - - -

Here's what I'm trying to say: Imagine a conceptual "space" with ASCII on one side and PostScript on the other. In between there's a countably infinite set of formalisms that can describe and render human languages. From this point of view, the Unicode standard is a small part of that domain but it is absorbing (in my opinion) so much of the available time and attention that other potentially more-useful regions of the domain are completely neglected.

- - - -

So, yeah, I think we should study languages and writing systems and computerize them carefully with native speakers and writers and linguistic experts in the room. And I think we would need what are in effect DSLs for each kind of writing system. (Not every language, but rather every kind of way that languages are written down.)

> how would you process the content of a string that is actually a DSL

Parse it to a data-structure, the simplest that will suffice for the language's structure. Work with it using defined functions (API). This is what we do already but the fact that English could be represented as array<char> reasonably well tends to obscure it.

string_value.split()

Or better yet:

    >>> s = "What is the type of text?"
    >>> s.title()
    'What Is The Type Of Text?'

> With Unicode it's possible to write a library that can process text in any script

That seems like it's true but I don't think it is true in practice. In your reply to mjevans elsewhere in this thread,

> You can't determine [the correct way of connecting the characters] purely from Unicode, you have to also know the conventions used in writing Arabic script. However Unicode is not intended to encode such conventions.

And you point out that Unicode won't help you properly support cut-and-paste for Arabic. So you can't process text using Unicode if that text is Arabic. In fact, there may not be "text" in Arabic the way there is in English! There is written Arabic but not textual Arabic. In other words, Unicode may well be engaged in creating the textual form of Arabic (and other languages.)

> any one of thousands of different domain specific languages

I think there would be less than a hundred distinct formalisms that together could capture the ways we have come up with to write, perhaps less than a dozen, but I wouldn't want to bet on it.

> how would you ever be able to write one piece of code to work with all of them and all possible future permutations?

Maybe you can't.

But if it's possible it will be by figuring out the type of text, which means exactly to figure out the set of functions that make sense on text. At which point your code can use those functions (the API of the TextType) to abstract over text. Like the str.title() method. Does that even makes sense in Chinese or Arabic?

The comment by int_19h in this thread speaks to this point really well:

> It's not about encodings at all, actually. It's about the API that is presented to the programmer.

> And the way you take it all into account is by refusing to accept any defaults. So, for example, a string type should not have a "length" operation at all. It should have "length in code points", "length in graphemes" etc operations. And maybe, if you do decide to expose UTF-8 (which I think is a bad idea) - "length in bytes". But every time someone talks about the length, they should be forced to specify what they want (and hence think about why they actually need it).

> Similarly, strings shouldn't support simple indexing - at all. They should support operations like "nth codepoint", "nth grapheme" etc. Again, forcing the programmer to decide every time, and to think about the implications of those decisions.

> It wouldn't solve all problems, of course. But I bet it would reduce them significantly, because wrong assumptions about strings are the most common source of problems.

What you're asking for is the base type for "text" for all languages, the ur-basestring, if you will. (It may not exist.)

> Finally if your DSL is producing display output, how does that work with fonts? What if you want to vary the appearance of the output, how do you apply that to the encoding output? It just seems that this approach produces an enormous monolithic super-complex rabbit hole with no bottom in sight.

Well again, computerized text is a new thing under the sun, different from writing, which has been happening all over the world for thousands of years (cf. Rongorongo[2]) Separating the "text" from the written form of the text (the display) is a new and metaphysical thing to do. For languages like English we get pretty far with encoding the Alphabet and some punctuation marks and putting them in a row. We completely bunted on capitalization though, we pretend that 'a' and 'A' are two different things. Typefaces can be abstracted from the stream of encoded byte/characters and treated as metadata. If you want to include it in a digital document you immediately have to define a DSL (Rich Text Format for example) to shoehorn the metadata back into the byte stream. Complications ensue.

For some languages (e.g. Arabic) it may not make sense to abstract the display of the text from the text. (Again, writing is exactly display. It is literally (no pun intended) the act of displaying language.) You have to include metadata in addition to the graphemes in order to recreate the correct display of the text, so you have to have some kind of DSL for the task.

As I said above, I don't think there are more than one or two dozen truly different ways of writing. A set of DSLs (perhaps not dissimilar to the generative L-Systems that can produce myriad realistic plant-like images from a small set of operations) could presumably model those ways of writing.

Unicode was a start on computerization of written languages. I think an approach that treats each kind of writing system as a first-class object of study in its own right will give us standard models for dealing with text in each kind in digital form. We should strive for computerized writing systems that are "as simple as possible, but no simpler." And, yes, it seems to me that some of them will have to include producing display output.

[1] DuckDuckGo image search for "letter A" https://duckduckgo.com/?q=letter+a&t=ffsb&atb=v60-2_b&iax=1&...

[2] https://en.wikipedia.org/wiki/Rongorongo

- - - -

Here's my "Cartoon History of Unicode":

    1. ASCII exists
    2. Europe does too!  Extend ASCII with the funky umlauts or whatever.
    3. Oh shit! Japan! Mojibake!
    4. I know! Let's use *sixteen* bits!  That'll solve everything.
    5. What do you mean Chinese is different from Japanese?
    6. WTF Arabic!?
    7. Boy there sure are a lot of graphemes.  Gotta collect 'em all.
    8. PIZZA SLICE
    9. POOP

At which point we reach "peak internet" and Doge appears to say "wow".

link

ubershmekel 3208 days ago

> Unicode is a horrible scam, the worst thing to happen to digital language representation.

> They baldly admit that Unicode is not good for drawing Arabic

> Consider the "Base-4 fractions in Telugu" [...] any software that wants to use them properly will require some code to translate to and from numbers in the computer to Telugu sequences of those graphics.

Written language is hard to represent, encode and draw. You admit that Unicode&utf8 got 2 out of 3 right yet you call it a scam.

Your complaint is a scam and a horrible trollism.

link

carapace 3208 days ago

When I'm trolling you'll know it. I have a point, I believe it's a good point, and I'm making it.

For languages that can be represented as a sequence of little pictures Unicode is a little better than ASCII. For the rest, it's a scam: We tell people that we have a way of dealing with human languages in computers but it's half-baked, born in ignorance, and all the grotty details are papered over, but you can write PIZZA SLICE or POOP now, so fuck it, ship it.

Represent: "Astral Plane"? What does that have to do with a standard catalog mapping numbers to pictures? I feel Unicode messes that up.

Encoding: UTF-8 is near perfect, 'nuff said. Ken Thompson and Rob Pike doing their thing.

Drawing: Doesn't even begin to touch it really.

Unicode is a nasty little black hole that's sucking up time and other resources and not really solving the problem.

link

mixmastamyk 3207 days ago

I think that's too negative. Is Unicode perfect? Of course not, but it's the best we've got for now. Just as Morse, Baudot, or ASCII were the best approximations at one point in time.

It's a hard problem and will take decades to for the right solutions/implementations to present themselves. Surely one day there will be an improved successor to Unicode. Things are a lot better than they were even ten years ago, however.

link

carapace 3207 days ago

Yeah, sorry, I was pretty cranky last night. Please see my reply to simonh in this thread a few minutes ago. (I'm basically agreeing with you.)

link

ubernostrum 3208 days ago

Speaking for myself, Unicode's original fundamental mistake was one that could only be recognized as a mistake in hindsight: insisting on round-trip compatibility with existing encodings.

Round-trip compatibility meant Unicode had to not only adopt but permanently preserve all the mistakes and inconsistencies of encodings which were popular at the time. Which is how we get a bunch of duplicates, a bunch of code points that are there but only supposed to be used for round-tripping, some of the un-fun edge cases for Latin text where things have both composed and decomposed forms, some of the weirder aspects of equivalence and normalization, etc.

At the time it seemed like a smart and rational thing to do since it meant you could losslessly transition from your existing character set, and then losslessly go back to it if you wanted to, but now that Unicode "won" it's just a source of "well, that's annoying and inconsistent but they needed it for round-tripping" explanations.

In particular, round-trip compatibility meant that Unicode ended up containing a bunch of variant forms of things that existing encodings treated as distinct characters, but which probably would not pass the test of being distinct graphemes by Unicode's definition. Declaring the variant forms to be a contextual issue left up to the font or the rendering system would have been better,

Ironically, the second big mistake was to then try to switch philosophies and do just that with the CJK characters, sparking the whole Han unification mess.

link

mnarayan01 3208 days ago

> I feel like it's a huge scam and a kind of cultural imperialism from us hacker types to the folks who are late to the party and for whom ASCII++ isn't going to really cut it.

It's more pay-to-play than "cultural imperialism". Arabic does seem to suffer due to no primarily-Arabic country being a member of the consortium (IIRC and it hasn't changed in the last five years). If someone was willing to absorb that cost then they could almost certainly get things done (e.g. look at Japanese).

While the pay-to-play aspect is obviously not utopia, it does seem to work quite well in practice: Arabic does have a large amount of support as is; to get more, you really need to have people who use Arabic as primary members so they can make the hard decisions.

link

carapace 3207 days ago

"Hey Arabs, we'll computerize your language if you pay for it or show up otherwise we'll do it anyway, poorly, because it's fun for us and it makes us feel like we're helping. Hope that works for you 'cause it's what you're going to get whether you want it or not."

Yeah, I don't have a lot of respect for that.

link

mjevans 3208 days ago

Is it possible to, based purely on the context of the symbols that the Unicode standard has added, determine the correct way of connecting the characters? Does the written language have such a formal mechanic?

Or is are tashkil extra distinctive modifications that might be like spices or sauces per (whatever you want to call a single display slot)?

link

simonh 3208 days ago

You can't determine that purely from Unicode, you have to also know the conventions used in writing Arabic script. However Unicode is not intended to encode such conventions.

Suppose these conventions change, as they have throughout history? Or if there are different variations of these conventions in different regions or sub-dialects? Also for example in Arabic it's often possible to determine the pronunciation of a word from it's context in a sentence, but in other contexts it isn't and so the tashkil are added. There's no way for a system like Unicde to ddecide that for you. For example suppose you cut-and-paste the word from one sentence into another, should Unicode somehow automatically add or remove the tashkil? No, that's up to the author (e.g. performing the edit in a word processor) or the program performing the operation if it's being done programatically.

Unicode provides one layer in the stack. Fonts provide another layer. Program code or editorial sensibility provides another layer. Many criticisms of Unicode are premised on the expectation that it should be solving problems that belong to another layer. Not all criticisms, it's a complex system that has had to make many compromises and there have been a series of mistakes in it's history, but taken overall it's been unbelievably successful and useful.

I'm in awe of the way it solves such a huge range of problems in the space, that people picking nits about the gaps that remain are piss me off, especially when they're based on a fundamental misunderstanding of the problem it's actually solving. Cynicism is easy, solving hard problems is not. I know who gets my respect.

link

mjevans 3208 days ago

It sounds like the written language is more similar to vocal musical instructions; and also like the 'spices' analogy that I was making is how the changes in display and connecting forms work.

With so much complexity and variability present at the time in history that the written form becomes fixed, I can't imagine any solution actually being easy, and can only think of editor software offering an emoji like list of likely 'accents' for a given 'word' (and breaking it down per character for corrections).

Such a system sounds incredibly tedious for user and programmer alike. I am glad that written 'western' languages became 'fixed' many centuries ago.

link