| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by abdullahkhalids 2 days ago

This problem is not limited to Arabic. Variants of the arabic alphabet are used by Persian (including Iranian and Dari dialects), Mazanderani, Qashqai, Luri, Gilaki, Kurdish (excluding Kurds in Turkey), Talysh, Azerbaijani (in Iran), Pamir languages, Pashto, Urdu, Balochi, Sindhi (in Pakistan), Punjabi (in Pakistan), Uzbek (in Afghanistan), Turkmen (in Afghanistan), Saraiki, Hindko, Brahui, languages spoken in Kashmir.

Whole languages are dying out because people are unable to express them properly on computers. Even popular software that dominate these speakers does not care to improve their experience. For example, Urdu has traditionally been written in the Nastaliq form [1], but is usually is rendered everywhere in the Naskh form [2]. There is no way to change this, for example, in Android without basically rooting it and changing the system fonts.

[1] https://en.wikipedia.org/wiki/Nastaliq

[2] https://en.wikipedia.org/wiki/Naskh_(script)

4 comments

helterskelter 2 days ago

> There is no way to change this, for example, in Android without basically rooting it and changing the system fonts.

I am really surprised Android won't let the user select their own system font. This is a huge accessibility problem, especially for dyslexics.

Gander5739 1 day ago

You can do it on some vendors' versions, sometimes requiring third party apps like zfont.

Conscat 2 days ago

I feel like I've never gotten a compelling explanation for why Nastaliq is hard/unavailable. I'm not an expert on abjads, but it doesn't look harder to render then Naskh (and it self-evidently is possible since the fonts exist). Does anyone here know why they make it difficult? Urdu is much less obscure than, say, Sharada or other languages with Unicode support. I think Punjabi is also often written in Nastaliq when it's not in Gurmukhi or Roman.

bradrn 2 days ago

In Naskh, each letter has only four forms (for the most part — there are a few ligatures etc. but I think ‘only four forms’ remains basically true). The choice between forms is determined almost entirely by position within a word (initial/medial/final/isolated). All the letters are aligned along the baseline and connect to each other in basically the same way.

By contrast Nastaliq is a much more complicated style. Many letters and letter combinations take on several different forms depending on which other letters surround them. Letter joins are usually diagonal, so letters earlier in a word need to be shifted above the baseline by a variable amount. Having to shift letters vertically as well as horizontally greatly complicates other aspects of the style too.

(I recall seeing a nice table some time ago showing all the various different possibilities for letter joins in Nastaliq. Unfortunately I can’t seem to find it again. Still, you might get some idea by consulting the documentation of one of the existing Nastaliq fonts, e.g. Awami Nastaliq: https://software.sil.org/awami/what-is-special/)

linmer 1 day ago

Yeah, but the difficulty isn't in rendering the fonts, it's for the font creator. So once the font is ready with all the combinations it rendering and using a Nastaliq font doesn't differ with rendering a Naskh. Nastaliq fonts are available in Persian, not sure if true for other languages, but it's just more complexity on making the font. For using a ready font the only thing needed is permission to change the font.

bradrn 1 day ago

Yep, that’s what I meant; thanks for clarifying the point.

(Though that said, a sibling post linked this interesting talk on limitations in OpenType itself: https://www.tiro.com/John/TypeCon2014_Hudson_DECK.pdf)

ablob 1 day ago

afaik this is a non-issue with modern text rendering engines. Modern font files include rulesets to determine the forms and shaping engines apply these rules to eventually reach the desired "shape" (i.e. order, position and which glyphs to render). For example, if you use HarfBuzz it should be able to calculate the Glyphs and offsets you need for a properly set script.

I personally spent way to much time trying to understand it, but at least according to this video (https://www.youtube.com/watch?v=VaA0v0V4RsU) it really is not that difficult if you leave out all the font-selection and emoji shenanigans.

I think at least FreeType (glyph rendering) and HarfBuzz (text shaping) make it needlessly complex through their documentation. It is extensive in describing what the parts do, but the only way to figure out what you need is by fiddling around. As soon as you want to do more complex stuff you're on your own. Especially figuring out which parts you don't need is annoying.

yorwba 1 day ago

SIL's Nastaliq font uses their own Graphite engine, which is included in Firefox but not other browsers (Demo page: https://graphite.sil.org/graphite_fontdemo ), but e.g. Noto Nastaliq Urdu also exists https://fonts.google.com/noto/specimen/Noto+Nastaliq+Urdu?pr... and does a decent job in non-Graphite engines, certainly better than Awami Nastaliq without Graphite.

So the real question is why Android doesn't make it easy to put Noto Nastaliq Urdu in the font stack.

ValdikSS 1 day ago

>compelling explanation for why Nastaliq is hard/unavailable

https://www.tiro.com/John/TypeCon2014_Hudson_DECK.pdf

abdullahkhalids 1 day ago

Nastaliq fonts already exist and used whenever possible. Yes, rendering a Nastaliq font takes marginally more compute. But in a world of electron apps which have a 10-100 times slower UX than they could be, saving compute is not an argument.

abdullahkhalids 1 day ago

There are many high quality Nastaliq fonts available. You can install them on your computer and use them easily in whatever software (example office apps) allows you to set the font.

There are no technical reasons preventing the use of Nastaliq fonts everywhere. Only product design decisions by big tech.

mchaver 2 days ago

My guess would be line height is a challenge and Naskh already exists. Then probably because these scripts are not used often in the places that are centers of software/OS development.

smitty1e 2 days ago

This seems an esoteric problem for the outsider.

But consider how cursive is dying out in (at least American) English, and how many centuries of writing will become unintelligible to the casual reader as a result.

All of these important cultural artifacts require maintenance.

RetroTechie 1 day ago

> All of these important cultural artifacts require maintenance.

This. Arabic users can complain about eg. Unicode not covering their writing in a suitable manner. And I (as a non-Arabic) can certainly see the problems described in the article.

But -going back to earlier days of computing- what stopped Arabic countries from devising a system that does that better than Unicode? (and covers other written languages like Hangul, Japanese or traditional Chinese, better than Unicode covers them)

Seems like that didn't happen? Either too few Arabic people cared, or solution(s) they came up with had shortcomings of their own & weren't implemented widely enough, or Unicode was good enough that few Arabic developers cared to go beyond that.

abdullahkhalids 1 day ago

It's likely the same problem as in Pakistan. Due to the history of colonialism/control by European powers, in these countries personal economic success is usually tied to command of English or French. So even within each of these countries, the rich, educated and those in power prefer latin script. Consequently, there was never any strong push to develop computing technology for local languages.

The other reason is that it's not technologically simple to solve all the issues highlighted in the TFA. Unicode actually does a pretty decent job of setting a uniform standard, but a lot of software has to be written on top of it to get the entire system working: (1) your software must support bidi text, (2) good fonts must be available to display the text in multiple languages (3) textual data needs to be properly stored in unicode and transmitted as is at every point in the OS (4) search engines must deal with the complications of non-breaking spaces and legacy unicode characters.

You have to kind of rewrite the entire stack from top to bottom. Preferential Arabic/Persian/Urdu speakers never had the technical skills and the political power to drive those changes in software largely written in different continents.

cenamus 2 days ago

This has pretty much already happened for the older style of German cursive, called Kurrent. Partly also because the Nazis got rid of it.

https://en.wikipedia.org/wiki/Kurrent

Tons of old documents written in it, basically impossible to decipher for anyone that only learned to write "modern" cursive or even print.

mohamedkoubaa 2 days ago

I don't know why people look down their noses at Arabizi

abdullahkhalids 2 days ago

Because people don't want to abandon hundreds or thousands of years of culture for a completely solvable problem.

vessenes 2 days ago

I don’t know either, but I am aware that in glyph based languages (and this article makes the case that Arabic has some glyph-like features), there is considerable social discussion about the equivalents, like pinyin. Detractors worry that sound-based (where sounds are based on the latin / western orthography) approaches to writing change something fundamental in people’s brains as distinct from more native versions.

In Chinese for instance, you can use a keyboard that combines radicals - parts of a character, or you can use a keyboard that combines phonemes. Those seem likely to change literally how you think in your language. There may be related concerns for Arabic.

That said, one of the complaints in the blog is that two different codepoints render to the same exact letter / phrase / word — this is not a problem unique to Arabic in Unicode, and there are known approaches: I’d expect (I’m not a Unicode expert by any means) that more work on the tech stack for rectification (I’m sure there’s a technical Unicode word for this process of matching codepoints for e.g. search and uniqueness of rendering) would likely be useful for Arabic, and relatively seamlessly flow in many places.

e28eta 2 days ago

> I’m sure there’s a technical Unicode word for this process of matching codepoints for e.g. search and uniqueness of rendering

That’d be Unicode Normalization. I don’t have an opinion on the best source for more details, so here’s a link from unicode.org https://www.unicode.org/reports/tr15/

I don’t know enough to know whether or not there are still Arabic-specific issues, either in the spec or the implementations.

The example in the article of copy/paste/search is interesting. I think it’s equally likely to be a RtL issue as a normalization bug, but I haven’t done anything significant with either topic.

mohamedkoubaa 1 day ago

I would push back about Arabic being glyph based, it's a phoneme rendered beautifully on paper. a modified latin script could faithfully reproduce it semantically except that readers won't be able to stop and smell the roses. Arguably, once someone is accustomed to reading a script, they don't think or care about the aesthetics much, and if they did, that's a bad property for information density anyways

linmer 1 day ago

I looked at Arabizi and the numbers are really annoying for formal text etc. Finglish is better in my opinion, however it causes problems like being able to read the same text in two ways. like "dar". The a can be like 'a' in 'dad' or it can be 'a' like in 'car'. with different pronunciation it means 'door' and 'gallow' which can be very annoying in Arabic languages that unlike Persian write _ُ_ِ_ٌ_ً_ٍ_ّ_. Instead of numbers it uses combinations like 'kh' for 'خ', 'gh' for 'ق' and 'غ'. In some methods they use 'aa' for 'a' sound like 'bar' and single 'a' for 'a' sound like 'lad'.

mchaver 2 days ago

Probably because it's a work around and not what most people want to do. Imagine someone telling you you have to type English in Cyrillic. I know if I could no longer type out Chinese characters and had to use pinyin it would feel very odd and like something was taken away.

pseingatl 2 days ago

For a while, Arabizi was wildly popular and universally used on feature phones. When mobiles became smarter, it was used less. Japanese has romaji and Mandarin has pinyin. Arabic's Arabizi would increase literacy rates and solve all these digital problems.

avadodin 2 days ago

Romanization is a separate issue to using fixed glyphs.

There was a theory in the XIX / early XX century that full literacy was impossible without the Latin script but such claims are ridiculous especially for Arabic which is an alphabetic script already. China has higher literacy rates than Vietnam, for example.

I don't think the many composition rules of Unicode are really necessary, though. Maybe as an extension for academic work or artistic compositions but not for computing.

If all we had were movable types, all of these language users would find a way to write their language that wouldn't require a Turing-complete computer on each glyph. Now the Unicode gods pander to some of these computer-hostile scripts making the users of different scripts feel slighted.

cyphar 2 days ago

The vast majority of Japanese and Mandarin speakers are also not in favour of replacing their current writing systems (which give them a link to thousands of years of their own history) in favour of simplified systems. I suspect it is the same for Arabic speakers.

numpad0 2 days ago

Romaji/pinyin are widely used for typing the actual written scripts. They're not seen as alternate written scripts outside of edge case scenarios(like chats in FPS)

throwaway27448 2 days ago

I generally agree with what you're saying, but there is rather famously a simplified form of chinese that was designed specifically to increase literacy rates.

cyphar 1 day ago

Japanese also underwent simplification post-WW2, but there is important context here.

In both cases, the original plan was for Chinese characters to fall out of use entirely via gradual simplifications, but in both cases the simplifications stopped soon after the first planned stages and it seems very unlikely it would be a popular initiative at this stage. Basically what happened in both cases was the equivalent of a spelling reform, not the elimination of a writing system.

In the case of Japanese, it seems there is some regret around simplification because characters not in the 常用漢字表 do not have the component simplifications applied that the standardised characters do (for instance, 攪拌 is more commonly used than 撹拌 despite the 覚 component being the "modern" simplified form -- and there are characters with no simplified form like the first character in 艱難, first character in 辻褄, or 迄).

mxchelsemaan 1 day ago

It's aesthetically revolting and allows for multiple renderings of the same word.