Hacker News new | ask | show | jobs
by rhythmvs 3106 days ago
Punctuation is a challenge. To electronically type-write modern Ethiopic languages (like Amharic) in Ge’ez script the regular (Western) word space (U+0020) is used as a word boundary. Properly typesetting classical Ge’ez however requires usage of the colon-like Ethiopic Wordspace (፡ U+1361). But how to do so and implement text encoding is unsure.

First, I simply substituted all word spaces (and equivalent white space characters) with U+1361. Obviously this breaks things. For one, each and every text editor (unaware as they are of ፡’s existence as an alternative word divider character) treats the incoming bitstream as a very long string in which there is not a single word boundary, and thus no opportunity to break lines. Combined with the fact that there do not exist hyphenation pattern dictionaries for classical Ge’ez, text encoded like this can basically not even be rendered on screen, unless as a single, indefinitely overflowing line.

Next, convinced that Unicode’s U+1361 was unpractical to be used as a _character_ (at the level of the text encoding), I implemented it as an alternate _glyph_ to the regular word space, using OpenType glyph substitution (thus at the level of the font and text shaping). This worked out beautifully, because now I could cheat typesetting engines, taking advantage of common line-breaking algorithms (which not only use word dividers as line-breaking opportunities, but also stretch/shrink them to justify lines). Unfortunately, as word spaces are stretched or shrunk, the OpenType glyph shaping engine, while drawing the colon-like ፡ Ethiopic word divider, is not aware of the available space, which thus is placed unsatisfactory, either to near to the preceding word, or worse, even overlapping the preceding character.

Eventually, I went for a hybrid approach whereby I used a combination of U+0020 + U+1361 + U+0020 (i.e. surrounding the fixed-width Ethiopic word divider with regular, flexible white spaces). While this is an ugly hack (certainly from a puristic text encoding perspective), it practically solves the issue, with nicely spaced-out ፡s in-between words.

Another, related issue concerns the lack of hyphens. Since word boundaries in classical Ethiopic are unambiguously marked with the explicitly drawn ፡, there’s no need to indicate when a word is broken in-between syllables at the end of a line. If a line ends with ፡, the reader will know the next couple of glyphs on the following line will not belong to the preceding word, but form another one. Else, it must be assumed that the syllables following the last ፡ on the line, will form a word together with the syllables on the following one up-to the next ፡. But as no current typesetting software supports this locale, one again needs to find a hackish work-around. I did so, at the level of the font, by putting an empty, zero-width glyph at the U+002D codepoint (hyphen-minus)…

There are some more issues involved with typesetting classical Ethiopic Ge’ez, but word dividers, hyphenation and line-breaking are the toughest.

If you’d like to know more, the W3C has an Editor’s Draft concerning ‘Ethiopic Layout Requirements’ [1], but many of the issues raised remain as of yet unresolved, pending user feedback. I found an Individual Contribution (For consideration by the Unicode Technical Committee) “Proposal to Reclassify Ethiopic Wordspace as a Space Separator (Zs) Symbol” [2] on the Unicode.org website, being very illustrative and offering thorough suggestions for implementation details.

I’d love to discuss these and other scholarly typesetting issues with anyone interested. Do check out my Dodecaglotta side project and get in touch!

[1] https://w3c.github.io/elreq/ [2] http://unicode.org/L2/L2015/15148-ethiopic-wordspace.pdf [3] http://dodecaglotta.com/#type-design