Hacker News new | ask | show | jobs
by winety 1187 days ago
It’s crazy, and that’s why hyphenation doesn’t really work that way. Both TeX and web browsers use Liang’s algorithm to split words. [1] It uses so-called patterns, which are short substrings of words in which numbers indicate how to divide the word. For example, the pattern “s1h” indicates that in the word “fishing”, a divider can be inserted between “s” and “h”. Patterns compete and can override each other, and the whole thing is quite complicated. As for your example with Qishan — the “s-h” probably overrides the “i-s” pattern. (There have been a number of articles in TeX journals that explain the algorithm, such as [2].)

In CSS, automatic hyphenation must be explicitly turned on, see [3].

In TeX and in CSS, hyphenation points can be marked explicitly: in TeX with the \- macro and in CSS with the ­ or U+00AD character. In TeX you can also override the automatic division with \hyphenation{}.

The splitting algorithm in CSS is worse than the one in TeX, because it has to work in real time and because (good) splitting patterns are often missing.

[1]: https://www.tug.org/docs/liang/

[2]: https://www.fi.muni.cz/usr/sojka/papers/euro01.pdf

[3]: https://developer.mozilla.org/en-US/docs/Web/CSS/hyphens

2 comments

It seems very clear that Amazon's default approach is to insert hyphens based on a whitelist of correct hyphenation points.

And that is what the algorithm you refer to does! Your links [1] and [2] speak specifically in terms of the patterns being a form of data compression that is applied to lighten the storage requirements of a big list of correct hyphenation points. The hyphenation algorithm is just that you check the word you want to hyphenate against the Master List Of All Words and learn where hyphenation is allowed. The patterns are a form of data preprocessing that makes that algorithm more efficient (here, in terms of space requirements) without changing the output.

So what we need is a way to extend the set of precomputed rules whenever we want to use a word that wasn't in the original dictionary. As noted, TeX provides this with the \hyphenation{} command. Why is this not available in CSS?

Suppose I want to write an ebook that doesn't make mistakes on the level of "fis-hing" and "f-orest". [Another example I'm not making up; the Kindle app is convinced that "Ts-inghua" is correct hyphenation.] How do I include the hyphenation information in my document?

> The splitting algorithm in CSS is worse than the one in TeX, because it has to work in real time and because (good) splitting patterns are often missing.

Surely that's only the case for real-time renderers like web browsers.

If you're creating a layout engine for printed media that uses CSS as the way for authors/setters to specify style, couldn't it implement a better, slower splitting algorithm? Using an internal (or pluggable?) dictionary of hyphenations?

You could and that's basically what TeX does, just without the CSS. There are even typesetting systems similar to (La)TeX, that can take XML as input, see Context [1] or Sile [2]. They’re just a step away from using HTML + CSS. Why isn’t there such system? I do not know.

[1]: https://wiki.contextgarden.net/XML

[2]: https://sile-typesetter.org/

You could probably write some JS to reimplement the better algorithm and insert the ­ hyphenation hints