Hacker News new | ask | show | jobs
by hf 3549 days ago
I simply cannot wrap my head around the direction of the Unicode discourse.

We're discussing the appropriate code-point for different smiley faces, obscure electrical symbols[0] or, in the present case, half stars to express film or book ratings, yet we have no complete set of sub- and superscripts!

Am I mistaken in thinking it odd, that there's a complete Klingon alphabet but no representation whatsoever for most Greek or Latin subscripts? Or what if, heaven forbid, I'd want to use a 'b' index/subscript? Tough! Not even the "phonetic extensions", where subscript-i comes from, provides it.

Refer to https://en.wikipedia.org/wiki/Unicode_subscripts_and_supersc... or look for SUBSCRIPT in http://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

Surely there's the one or two actual scientists on the Unicode consortium? Or even the one odd soul still sporting a notion of consistency who finds it only logical to provide a "subscript b" if there's a "subscript a"?

How am I wrong?

[0] https://news.ycombinator.com/item?id=11958682

4 comments

Unicode is not known for its consistency in dealing with these issues. The original idea behind Unicode was to be able to represent every then-extant character set with perfect fidelity (i.e., go from X to Unicode and back, and you should get the same data). Why are there letters like U+212B Angstrom sign (not to be confused with U+00C5 Latin capital A with ring above) or things like half-width and full-width characters? Because they were present in Shift-JIS, not because of any coherent notion of what constitutes a glyph. Han unification was driven more by the need to keep from blowing a space budget than by actual rationalization of whether or not the scripts deserved separate spaces.

Note that Klingon isn't in Unicode (it was explicitly rejected by the UTC, with a vote of 9 in favor of the rejection proposal, 0 against it, and 1 abstaining). Tengwar and Cirth, though, are actually considered serious proposals for Unicode, just really, really low priority compared to, say, Mayan script (for which the first proposal should be going live in 2017). Mayan script is interesting in its own right because it's the script (well, of the ones I'm aware of) that most challenges normal conventions on what constitutes letters and glyphs.

ISTM a great deal of trouble and complication could have been prevented by three special types of NBSP that meant "sub", "super", and "back to normal". It's true that some glyphs will be special-cased by some fonts, but in general the glyph is just shrunk and translated when sub- or super-scripted.
Yes, just like the LRE/LRO/RLE/RLO/PDF/etc characters.
The Klingon alphabet was proposed but rejected.

Subscript letters were proposed as well: http://www.unicode.org/L2/L2011/11208-n4068.pdf but apparently "Not accepted: Because this has been controversial and is not directly related to repertoire under ballot, it is not appropriate to add it to Amd1 but may be considered for a future amendment" http://www.unicode.org/L2/L2012/12130-n4239.pdf

Looks like here's a recent draft for a new proposal: https://github.com/stevengj/subsuper-proposal

For those looking for Klingon, and many more ficticious fonts, there is the "ConScript Unicode Registry" [0] which defines the BMP Private Plane[1].

[0] https://en.wikipedia.org/wiki/ConScript_Unicode_Registry [1] https://en.wikipedia.org/wiki/Universal_Character_Set_charac...

Super/sub scripts are markup, not characters. There shouldn't be any in Unicode.
I beg to disagree. In science subscripts are part of the symbols, just like diacritics.

Superscripts, on the other hand, are part of math notation, like fractions and square roots.

I disagree. In math there can be super-super-superscripts, as with tetration representations https://en.wikipedia.org/wiki/Tetration . Does each get its own character, and when does it end?

In science, consider an isotope like

   180m
       Ta
    73
This cannot be represented as a sequence of symbols because that would give:

      180m           180m
          Ta   -or-        Ta
    73                   73
Markup is how Wikipedia represents it correct, as:

    <span style="display:inline-block;margin-bottom:-0.3em;
    vertical-align:-0.4em;line-height:1.0em;font-size:80%;text-align:right">180m<br>
    73</span>
How would you do it without markup?

In addition, pretty much anything can go in superscripts, including 2^א and integral equations. The most general solution is to have a "start superscript" and "end superscript" marker, with the ability to embed superscripts, but that still doesn't solve the isotope representation problem.

> The most general solution is to have a "start superscript" and "end superscript" marker, with the ability to embed superscripts, but that still doesn't solve the isotope representation problem.

Couldn't one have something like a "start zero-width superscript" marker, so that the following subscript would not be offset?

> Couldn't one have something like a "start zero-width superscript" marker, so that the following subscript would not be offset?

Well, the problem is that the subscript and superscript are both aligned with the following regular text, so you really need (for the isotope representation) a "start right-aligned zero-width superscript" marker, a "start right-aligned zero-width subscript" marker (though zero-width isn't exactly right, since they should have width, its just that only the wider of the super- and sub-script in a pair should be used in spacing the text) -- there might be other notation that also needs left-aligned versions of -- plus generic start/end superscript markers that have normal width flow, plus appropriate end markers.

It's not surprising that an offhand suggestion doesn't magically solve all problems, but I appreciate your taking the time carefully to explain what's missing. Thanks!