Hacker News new | ask | show | jobs
by baxter001 3906 days ago
Are there more basic string operations that need to be available for Arabic, a quick play with copy and paste produces all sorts of unusual effects I assume because of combining marks not being copied or being suitable in their new contexts.

I'd assume that the text is modified in units more complex than substrings and whitespace splitting?

3 comments

> I'd assume that the text is modified in units more complex than substrings and whitespace splitting?

Correct. Each letter in Arabic can have up to four shapes depending on its position in a word: initial, medial, final and isolated[0].

Also note that the sample text on the left contains Arabic diacritics[1]. These are usually omitted by native speakers/writers since they can be inferred from the context (the only exception I can think of is religious scripture), but they certainly add to the complexity of it all.

Needless to say, creating an Arabic mono-font is quite challenging. Kudos to the designer.

[0] https://en.wikipedia.org/wiki/Arabic_alphabet#Letter_forms

[1] https://en.wikipedia.org/wiki/Arabic_diacritics

Letters in Arabic have different rendering depending on where they lie in the word; at the start, middle, end, or following/preceding certain characters.

This is the main problem you see sometimes in movies where they try to show something in Arabic and they get the rendering wrong. They probably get each letter on its own and try to construct the words like that, where the letters do not join and the whole thing looks like a mess.

Fwiw, Greek also has one letter (sigma) that differs in rendering depending on where in the word it appears. It's the same letter, it's just that when it appears word-final, it looks different. But Unicode decided to split it into two codepoints, rather than treat it as a rendering issue. Therefore rendering of Greek codepoints never depends on position within the word, even though rendering of Greek letters can. Instead it's up to the user to make sure that whenever a lowercase sigma appears word-final, it should be encoded with a different codepoint, GREEK SMALL LETTER FINAL SIGMA (U+03C2), 'ς', rather than the usual codepoint, GREEK SMALL LETTER SIGMA (U+03C3), 'σ'.
I'll add to what others mentioned the fact the proper text selection for bidirectional text is awkward at best, and often outright impossible with buggy UIs. Particularly at the border of where the text changes direction.