| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by svat 1565 days ago

As someone very familiar with the Knuth–Plass line-breaking algorithm (https://tex.stackexchange.com/a/423578/48), an important difference I see here is that for paragraphs (the domain of TeX), there is no "state" that needs to be preserved across lines: if you know that your paragraph is going to choose a certain break-point, then you can pretty much typeset the "before" and "after" parts independently, each optimally. (With one exception: there is a penalty for hyphens being on successive lines, so we need to track whether the previous line was hyphenated.) This is the "optimal substructures" property that makes it so amenable to dynamic programming.

With the code formatter, to format the part after a certain character, you need to keep track of the indentation depth of all the expressions that have not yet terminated at this point — because you presumably want parallel expressions to be formatted with the same indentation depth, for closing parentheses to match their corresponding opening parentheses, etc.

For example, in this example:

    experimental = document.querySelectorAll('link').any((link) =>
        link.attributes['rel'] == 'import' &&
            link.attributes['href'] == POLYMER_EXPERIMENTAL_HTML);

and, say (I'm making this up):

    experimental = 
        document.querySelectorAll('link').any(
            (link) => 
                link.attributes['rel'] == 'import' &&
                    link.attributes['href'] == POLYMER_EXPERIMENTAL_HTML);

— knowing that there's a break after the `&&` is not enough; you also need to know the indentation of the previous expressions, to decide how you're going to format the part after the `&&`.

This is what the author alludes to in the post:

> A line break changes the indentation of the remainder of the statement, which in turn affects which other line breaks are needed. Sorry, Knuth. No dynamic programming this time. […] For most of the time, the formatter did use dynamic programming and memoization. […] It worked fairly well, but was a nightmare to debug.

> It was highly recursive, and ensuring that the keys to the memoization table were precise enough to not cause bugs but not so precise that the cache lookups always fail was a very delicate balancing act. Over time, the amount of data needed to uniquely identify the state of a subproblem grew, including things like the entire expression nesting stack at a point in the line, and the memoization table performed worse and worse.

In TeX, paragraphs have each line of the same width (simple case) or can have a \parshape (in general), but these are "global" constraints that don't depend on what breaks you choose.

1 comments

contravariant 1565 days ago

Not sure if saying you can't use dynamic programming is accurate though, you simply can't use a direction translation of Knuth's algorithm since this misses indentation.

If I recall correctly Knuth uses Dijkstra on a graph with nodes of 'line break at position x' and I don't see why you couldn't use a graph consisting of 'line break at position x indentation y' or something similar.

link

svat 1565 days ago

It's the "or something similar" that's the catch. For another example, search the post for `scriptLoadingTimeout` — to know how to indent the code after a break immediately after that position, you need to know the indentation of the `.timeout` before it, of the immediately preceding `.then(`, and of the `return` at the top — basically you need to know the indentation level of every parent in the expression tree. That means the graph's states are something like "line break at position x, with indentation of parent expression nodes being …, …, …", and then you have too many states as mentioned in the post. There's a combinatorial explosion of the state space. Using dynamic programming with this large state space is still possible, but approaches the running time of the brute-force algorithm. (I do wonder how extensively it was tried, though.)

link