Hacker News new | ask | show | jobs
by mdciotti 1283 days ago
I've frequently wondered why a hierarchical approach is the norm for text formatting. It seems that many problems could be solved trivially using a text buffer and a list of formatting sequences defined by a starting index and a length. The only place I've seen this in practice is in Telegram's TL Schema [1]. Is this method found anywhere else?

Edit to note: there is one obvious advantage to in-band markup such as HTML -- streaming formatted content. Though I wonder if this could be done with a non-hierarchical method, for example using in-band start tags which also encode the length.

Edit 2: looks like Condé Nast maintains a similar technology called atjson [2].

[1]: https://core.telegram.org/api/entities

[2]: https://github.com/CondeNast/atjson

4 comments

There are a number of rich text editors that model documents as a flat array of characters and a separate table of formatting modifiers (each with an offset and length). Medium's text editor is one of them. This post [1] on their engineering blog introduced me to the idea, and I think it's a good starting point for anyone interested in this topic.

ProseMirror (a JavaScript library for building rich text editors) also employs a document model like this. The docs for that project [2] do a good job of explaining how their implementation of this idea works, and what problems it solves.

[1]: https://medium.engineering/why-contenteditable-is-terrible-1...

[2]: https://prosemirror.net/docs/guide/#doc

"I've frequently wondered why a hierarchical approach is the norm for text formatting."

80/20, if not 90/10, effectiveness. Most people are not trying to do what the Wikipedia article is talking about. About the most complicated thing that people want to do is the moral equivalent of <i>italic <b>bold and italic</i> bold</b>, and you can losslessly convert that to <i>italic <b>bold and italic</b></i><b> bold</b> for almost all practical purposes.

It isn't until you're getting very precise about what your tags mean, for tags that intrinsically "cross" hierarchies like that, that you start seeing this issues. And then by the time you've gotten that far, you realize you have all sorts of problems, as the article says.

But a good deal of the answer is that while the stuff mentioned in the Wikipedia article is true and important, it's also fairly specialist.

As for "The only place I've seen this in practice is in Telegram's TL Schema [1]. Is this method found anywhere else?", tag-based formatting is the norm for rich text widgets, which generally can natively represent my first HTML example above in its internal format. Generally if you dig into your favorite language you'll find someone has already implemented this efficiently as a library you can pick up if you want to use the capability directly outside of a text widget. It has its own consequences, as anyone who has ever fought with them may realize, but it's not impossibly difficult to deal with.

It isn't a magic solution to everything either, though. Even if it is what you think you want, a widget able to represent a bold section starting in the middle of a paragraph, then proceeding through the first three rows of a table, then stopping in the middle of a paragraph in the third column of the next row is generally weird. To some extent, people have a certain hierarchiness to their thinking about these matters too, whether it's cause or effect. But that hierarchiness is messy; I think it's fair to say most people wouldn't "mean" that bold to mean something in my table case, we don't necessarily expect tags to proceed through tables like that, but <i>i<b>bi</i>b</b> is something that people might intuitively expect to be able to do. It's a fractally messy space both in the computer science and human expectations, and the fractal messiness only gets messier when we try to harmonize those two things.

I guess because it would be a total pain for humans to read and write without specialised tooling. Imagine trying to add a word at the start of your document.
That list of formatting sequences would have to be updated with new indexes when the content of the buffer changed. Keeping the two in sync wouldn't be trivial (for a computer or a human), a tree of nodes fixes that and works for 99.99% of use cases.
It may not be trivial, but it's a solved problem. Many rich text UI widgets and corresponding backing data structures exist today, based on a tagging system where tags can trivially define regions that overlap with each other. It's tricky and full of corner cases, but not that hard if you put your mind to it, and it's not computationally inefficient either.