| Step One: Admit there's a problem. I heard, "Tell me more about what you think would be better." Here goes: For written languages that are well-served by a simple sequence of symbols (English, etc.) there is no problem: a catalog of the mappings from numbers to pictures is fine is all that is required. Put them in a sequence (anoint UTF-8 as the One True Encoding) and you're good-to-go. For languages that are NOT well-served by this simple abstraction the first thing to do (assuming you have the requisite breadth and depth of linguistic knowledge) is to figure out simple formal systems that do abstract the languages in question. Then determine equivalence classes and standardize the formal systems. Let the structure of the language abstraction be a "first-class" entity that has reference implementations. Instead of adding weird modifiers and other dynamic behavior to the code, let them be actual simple DSLs whose output is the proper graphics. Human languages are like a superset of what computers can represent. Here's the Unicode Standard[1] on Arabic: > The basic set of Arabic letters is well defined. Each letter receives only
one Unicode character value in the basic Arabic
block, no matter how many different contextual appearances it may exhibit in text. Each
Arabic letter in the Unicode Standard may
be said to represent the inherent semantic identity of the letter. A word is spelled as a
sequence of these letters. The representative
glyph shown in the Unicode character chart
for an Arabic letter is usually the form of the letter when standing by itself. It is simply used
to distinguish and identify the character in the code charts and does not restrict the glyphs
used to represent it. They baldly admit that Unicode is not good for drawing Arabic. I find the phrase "the inherent semantic identity of the letter" to be particularly rich. It's nearly mysticism. If it is inconvenient to try to represent a language in terms of a sequence of symbols, then let's represent it as a (simple) program that renders the language correctly, which allows us to shoehorn non-linear behavior into a sequence of symbols. If you think about it, this is already what Unicode is doing with modifiers and such. If you read further in the Unicode Standard doc I quoted above you'll see that they basically do create a kind of DSL for dealing with Arabic. I'm saying: make it explicit. Don't try to pretend that Unicode is one big standard for human languages. Admit that the "space" of writing systems is way bigger and more involved than Latin et. al. Study the problem of representing writing in a computer as a first-class issue. Publish reference implementations of code that can handle each kind of writing system along with the catalog of numbered pictures. From the Unicode Standard again: > The Arabic script is cursive, even in its printed form. As a result, the same
letter may be written in different forms depending on how it joins with its neighbors. Vow-els and various other marks may be written as combining marks called tashkil, which are applied to consonantal base letters. In normal writing, however, these marks are omitted. Computer systems that are adapted to English are not going to work for Arabic. I'd love to use a language simpler than PostScript to draw Arabic! Unicode strings are not that language. Consider the "Base-4 fractions in Telugu" https://blog.plover.com/math/telugu.html The fact that we have a way to represent the graphics ౦౼౽౾౸౹౺౻ is great! But any software that wants to use them properly will require some code to translate to and from numbers in the computer to Telugu sequences of those graphics. Let that be part of "Unicode" and I'll shut up. In the meantime, I feel like it's a huge scam and a kind of cultural imperialism from us hacker types to the folks who are late to the party and for whom ASCII++ isn't going to really cut it. To sum up: I think the thing that replaces Unicode for dealing with human languages in digital form should: A.) Be created by linguists with help from computer folks, not by computer folks with some nagging from linguists (apologies to the linguist/computer folk who actually did the stuff.) B.) We should clearly state the problems first: What are the ways that human language are written down? C.) Write specific DSLs for each kind of writing. Publish reference implementations. I think that's it. Are you informed? Persuaded even? Entertained at least? ;-) [1] http://www.unicode.org/versions/Unicode9.0.0/ch09.pdf |
>They baldly admit that Unicode is not good for drawing Arabic.....I'd love to use a language simpler than PostScript to draw Arabic! Unicode strings are not that language.
Unicode isn't good for drawing anything. Unicode is not intended to, or try to encode how a text should be displayed. At all, even slightly. This is the root of my disagreement with your post. You're claiming it can't accurately render the appearance of text, but that simply isn't it's purpose. It is purely and only about encoding the graphemes. Glyphs are what fonts and display technologies like PostScript are for, not Unicode.
You could argue that it should do that, perhapse Unicode should be a vector drawing language or something, but it's hard to see how that would make it useful for text processing that does concern itself with graphemes and grapheme like units. Unless the display oriented system you want contained within it a grapheme encoding system like Unicode to facilitate that - but then why not work the other way around and use Unicode for that and build a display system on top of Unicode to address your concerns?
I think trying to have your cake and eat it with a family of distinct DSLs would be problematic. Text processing is bad enough, but how would you process the content of a string that is actually a DSL? With Unicode it's possible to write a library that can process text in any script, even ones not in the standard yet, but if text could consist of computer code in any one of thousands of different domain specific languages, how would you ever be able to write one piece of code to work with all of them and all possible future permutations? Finally if your DSL is producing display output, how does that work with fonts? What if you want to vary the appearance of the output, how do you apply that to the encoding output? It just seems that this approach produces an enormous monolithic super-complex rabbit hole with no bottom in sight.