|
|
|
|
|
by Manishearth
3385 days ago
|
|
> Are interlinear ruby annotation codepoints actually used for their intended purpose anywhere? Yes > And what are you supposed to do when you encounter one? Nothing. Don't display them, or display some symbolic representation. You probably shouldn't make ruby happen here; if your text is intended to be rendered correctly use a markup language. ---------- Unicode is ultimately a system for describing text. Not all stored text is intended to be rendered. This is why it has things like lacuna characters and other things. So when you come across some text using ruby, or some text with an unencodable glyph, what do you do? You use ruby annotations or IDS respectively. It lets you preserve the nature of the text without losing info. (Ruby is inside unicode instead of being completely deferred to markup since it is used often enough in Japanese text, especially whenever an irregular (not out of the "common" list) kanji is used. You're supposed to use markup if you actually want it rendered, but if you just wanted to store the text of a manuscript you can use ruby annotations) |
|
Note that I didn't actually ask you about rendering.
I care from the point of view of the base level of natural language processing. Some decisions that have nothing to do with rendering are:
- Do they count as graphemes?
- What do you do when you feed text containing ruby characters to a Japanese word segmenter (which is not going to be okay with crazy Unicode control characters, even those intended for Japanese)?
- Could they appear in the middle of a phrase you would reasonably search for? Should that phrase then be searchable without the ruby? Should the contents of the ruby also be searchable?
Seeing how ruby codepoints are actually used would help to decide how to process them. But as far as I can tell, they're not actually used (markup is used instead, quite reasonably). So I'm surprised that your answer is a flat "Yes".