|
|
|
|
|
by winety
1187 days ago
|
|
It’s crazy, and that’s why hyphenation doesn’t really work that way. Both TeX and web browsers use Liang’s algorithm to split words. [1] It uses so-called patterns, which are short substrings of words in which numbers indicate how to divide the word. For example, the pattern “s1h” indicates that in the word “fishing”, a divider can be inserted between “s” and “h”. Patterns compete and can override each other, and the whole thing is quite complicated. As for your example with Qishan — the “s-h” probably overrides the “i-s” pattern.
(There have been a number of articles in TeX journals that explain the algorithm, such as [2].) In CSS, automatic hyphenation must be explicitly turned on, see [3]. In TeX and in CSS, hyphenation points can be marked explicitly: in TeX with the \- macro and in CSS with the ­ or U+00AD character. In TeX you can also override the automatic division with \hyphenation{}. The splitting algorithm in CSS is worse than the one in TeX, because it has to work in real time and because (good) splitting patterns are often missing. [1]: https://www.tug.org/docs/liang/ [2]: https://www.fi.muni.cz/usr/sojka/papers/euro01.pdf [3]: https://developer.mozilla.org/en-US/docs/Web/CSS/hyphens |
|
And that is what the algorithm you refer to does! Your links [1] and [2] speak specifically in terms of the patterns being a form of data compression that is applied to lighten the storage requirements of a big list of correct hyphenation points. The hyphenation algorithm is just that you check the word you want to hyphenate against the Master List Of All Words and learn where hyphenation is allowed. The patterns are a form of data preprocessing that makes that algorithm more efficient (here, in terms of space requirements) without changing the output.
So what we need is a way to extend the set of precomputed rules whenever we want to use a word that wasn't in the original dictionary. As noted, TeX provides this with the \hyphenation{} command. Why is this not available in CSS?
Suppose I want to write an ebook that doesn't make mistakes on the level of "fis-hing" and "f-orest". [Another example I'm not making up; the Kindle app is convinced that "Ts-inghua" is correct hyphenation.] How do I include the hyphenation information in my document?