Hacker News new | ask | show | jobs
by KerrAvon 1102 days ago
It's not actually that different or complicated if you're already doing proportional plaintext rendering -- you need to store style attributes for a range of text, and you need to support different heights per line.

The real complexity is rendering all of unicode properly, and supporting international fonts, bidi layout, vertical text, etc.

3 comments

You don't need any fancy data structures. 95% of the performance goes into glyph rendering. And with Unicode the performance gain from monospace fonts goes out the window as some Unicode characters are very large, and not only that they also take up many bytes. So one Unicode character can be up to 5 bytes long and take up the same canvas space as 3 characters. You also need to read ahead as there are combination characters, for example a smiley combined with the color brow becomes a brown smiley. I've blogged about implement support for Unicode in an editor here: https://xn--zta-qla.com//en/blog/editor10.htm
> So one Unicode character can be up to 5 bytes long and take up the same canvas space as 3 characters.

5 bytes? In what encoding?

Some "single character" emoji easily exceed 5 bytes in all encodings. You may think ZWJ sequences are cheating, but emoji isn't the only language encoded in Unicode with complex ZWJ sequences.
My question is about the phrase up to 5. What in Unicode is up to 5? Codepoints are up to 4 in all the encodings I know. ZWJ sequences may as well be arbitrarily long. What is "up to 5"?
The original quote for reference:

> So one Unicode character can be up to 5 bytes long and take up the same canvas space as 3 characters.

FWIW, I didn't read that as suggesting an upper bound of 5 bytes, but rather as an example using arbitrary numbers: N bytes of code units could, depending on the font providing the glyph(s) for the respective grapheme(s), could be rendered at M times the size of, say, the letter A, where N != M -- despite the font otherwise being monospaced. Which is just another way of saying that you must consult the font for the character widths involved.

I think you're reading that quote as an assertion that:

    For any grapheme G, G can be encoded in at most 5 bytes.
While what I think was being said was:

    There exists a grapheme G, where G is encoded in 5 bytes, and the respective glyph happens to be displayed at 3 times a single character (e.g. the letter A), despite the font otherwise being monospaced. Therefore you *must* consult the font for each glyph to correctly determine character widths.
Continue to the next sentence of context:

> You also need to read ahead as there are combination characters, for example a smiley combined with the color brow becomes a brown smiley.

Emphasis mine. Clearly combination characters are being treated separately.

Frankly I think it's crazy to read "up to 5 bytes" and not think that it suggest an upper bound. I think you're reaching for a highly questionably interpretation of a totally unambiguous clause. If the author meant to express what you're saying, they would certainly have written: "Some Unicode characters are 5 bytes long and take up the same canvas space as 3 characters". Which would still look incorrect if they followed it with the sentence "You also need to read ahead as there are combination characters...".

It is far more likely that the author is simply mistaken and should have said 4 bytes, and perhaps used the word "codepoint" instead of "character" in the original sentence. That's a perfectly understandable technical error, while the reinterpretation you're putting together would imply an error of colloquial language.

Codepoints themselves could technically all fit into 3 bytes (or 21 bits to be precise), but there is no standard 3-byte encoding. The highest Unicode codepoint is 0x10FFFF.

An idea for a variable width encoding of 1 to 3 bytes: Read the MSB of each byte: If it's 0, don't read any more bytes. If it's 1, read the next byte. Do the same (up to 3 times). The non MSB bits of each byte then make up the codepoint.

    0xxxxxxx                    (ASCII)
    1xxxxxxx 0xxxxxxx           (0x0080 - 0x3FFF)
    1xxxxxxx 1xxxxxxx 0xxxxxxx  (0x4000 - 0x1FFFFF)
If the Unicode range grew in future to require further bits you could use the same technique by allowing greater than 3 bytes.

    1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx (0x200000 - 0x10000000)
The obvious drawback to this approach is that it is inherently serial. You need to read each byte before considering the next, so it would perform worse than UTF-8 in most cases.

Another drawback is that it is not self-synchronizing, which is one of the benefits of UTF-8.

It also has the issue that you can represent some codepoints with more than one encoding: eg, put ASCII characters into 2 or 3 bytes. So you would need rules to use the minimal encoding for each codepoint.

As a space-saving technique, it may offer better density than UTF-8 or UTF-16 on some texts.

You could also use a fixed-width encoding of 24-bits to avoid the problem of reading it serially, but as computers typically work in powers of 2, you would align 24-bit values at 32-bit addresses and load them into 32-bit+ registers, so there's nothing to really gain in terms of performance here over UTF-32, but you could save a bit of space.

Isn't this just utf8 without information on how many of the following bytes belong to the glyph?
I read this as colloquial English for "around" or "approximately". Not setting a bound limit, but setting an example size.
But you also read it as referring to ZWJ sequences? So the author has picked a number that is actually below average and they've worded it as up to...?

Saying a ZWJ sequence can be "up to 5 bytes" is like saying "the current generation of Intel processors run at clock speeds of up to 2 GHz".

If they were referring to ZWJ sequences (I don't think they were; I think they were just misremembering the maximum encoded length of a codepoint) and they had said "up to 35 bytes", then I might agree with you. It's still not technically accurate, but it's a reasonable colloquial usage, like saying "human males can grow up to seven feet tall".

There was an emoji example of length seven posted to HN recently:

<https://news.ycombinator.com/item?id=36159443>

Emojis with skin color, mostly
No, that's a ZWJ sequence. Those can be arbitrarily long. Doesn't explain where "5 bytes" comes from.
Is there a maximum number of "zero-width joiner" (ZWJ) sequences that can be combined to create a single-width emoji? (Yes, I know the term "single-width" is a loaded term.) I cannot find a precise answer.
> 5 bytes? In what encoding?

I believe UTF-8 reserved up to six bytes for a single character.

Yes, UTF-8 was up to 6 bytes per character early on. Some broken implementations like MySQL's limit it to up to 3 bytes per character. The actual number is 4.

So what is "up to 5"?

I think with decomposed Hangul, you can end up with 6 or more bytes per character, due to each part of it being two bytes, and 2-4(?) parts per character.
I think unicode, fonts, etc would be an issue even with a plain text editor anyway.

The "styling a range of text" is something i thought but you still need to somehow associate the text with the range - and vice versa - and this doesn't handle things like inserting images and other types of objects since these aren't text.

You could have a document be a series of "paragraphs", each being a series of "elements" with each "element" being something like "text" (with a style), "image", etc. But then once tables enter the picture, you need to expand paragraphs to be of "table" type and each table cell is itself a self-contained "series of paragraphs" - and then start thinking about nested tables or images in tables!

Generalize that enough to avoid special cases inside special cases and you end up with more of a tree-like structure representing a DOM and less with a linear structure with range-based styling.

(of course, then again, i don't remember Write for Windows 3.1 having tables in the first place :-P but i'm interested if there are alternative approaches anyway)

EDIT: one thing i forgot to mention - and why i am curious about non-DOM-based approaches - is that one problem with the DOM approach is the selection: with a linear/range-based structure the selection is just one or two indices inside the range, but with the DOM the selection can start from a node with node-specific subrange (e.g. character in a text node) and end with another node and both being very unrelated to each other (i.e. only having some distant common ancestor and not necessarily at the same level).

I might have a way to simplify this?

A plaintext document is an array of chars, a richtext document is tree, which may or may not be well-formed.

Think about someone trying to bold semi-half of_a sentence_, and how MS Frontpage was made by smart people, it’s just really hard.

The most interesting thing lately is the HTML attribute `contenteditable`, and how it almost just kinda works! You still have to be full-stack to make something good, but that was an amazing improvement to the browser.

Yeah, that trying to bold half of a sentence - or even better, the middle of a sentence - is why i was wondering about simpler alternatives to DOM. Some time ago i toyed around with an HTML editor[0] (that one had to use a DOM anyway, but my question is for rich text editing in general - BTW the rectangles in the shot show a selection that goes across nodes) and doing something like that involved traversing all the nodes (going both down and up the node tree, starting from the cursor's starting position), finding the closest common ancestors under the selection, creating "B" siblings to them and then reparenting them under these new "B" nodes.

You can move a lot of that stuff to reusable methods but personally i find the whole "editing" aspect to be more involved than the "drawing" side - and also the one more likely to be different than a plain text editor - when dealing with DOM-like structures. Hence why i am interested to see what alternatives there are.

[0] https://i.imgur.com/jLlyNSS.png

Nice. I’ve searched far and wide, and most people still end up starting a whole company that only does a text editor.

Quill works but is basically dead since 2017. Almost anything foss is in a similarly ambiguous boat.

And then people like us, defeated, eventually buy something when we actually need it.

There are just so many ways that users try to use it. It’s a tough problem!

Text is usually stored as tree either way in an editor, using a DOM-like approach might work well on top of the usual datastructures.

> with the DOM the selection can start from a node with node-specific subrange (e.g. character in a text node) and end with another node and both being very unrelated to each other

I'd just store the range as character indices, using those the right nodes in the tree can be accessed pretty quickly as needed.

I don't think character indices are enough, what if your selection begins at the middle of a table cell and ends on an image that is the only child of a cell in a completely different table (no text involved at all, except some text in the cells in between)? If you want to, e.g., delete those how do you find which nodes are to be deleted and updated (e.g. for merging the two tables if there are cells after the one that contains the image)?
Images and table cells are just nodes within the tree holding the text, assuming all styling etc is represented in a plain text syntax similar to Markdown, of course. Looking up the nodes from char indices is quick if each node stores how many chars it contains.

Other approaches would probably require the selection to be a tree of its own, I can't really say whether that's simpler overall or not.

The syntax shouldn't matter (you may not even being using a plain text syntax - or any syntax - anyway), you could treat an image or whatever as a single "special" character. Or just assign a linearly increasing ID (increasing in the order the text, images, etc flows) to each node.

Though that is basically another way to represent what i wrote above with having a pair of node pointers and a subrange (well, an index actually, the other end of the subrange is implicit if the node pointers are different). This is basically what the old HTML editing control Microsoft had back in the 90s used and that worked with the DOM tree (also what i used in a test editor i wrote some time ago). And yeah it isn't simple.

Start and end of a selection would be nodes in the tree. The algorithm to find all the selected nodes is already part of the renderer, as the renderer has to implement traversal of nodes in text order as well.
Indeed, you could even use a piece tree just like in the blog but you store additional information in each piece which tell the renderer how to layout the associated text. My understanding is that most rich text editors represent the text as a node-based tree anyway.
Some early rich text editors never used a tree representation for formatting represented the formatting as "control character sequences". This was part of why some people loved WordPerfect so much because you could toggle a view of all the control characters and just delete/copy/paste them like any other text in the same document.

It's basically how the classic RTF format [1] works, and things like VT100/ANSI escape codes in terminals. It's kind of like the difference between imperative code and declarative code: "this character sequence means toggle the state of bold" versus "this node of characters is bold".

[1] https://docs.fileformat.com/word-processing/rtf/