Hacker News new | ask | show | jobs
by vcmiraldo 1543 days ago
I really like the idea of focusing on producing patches for human consumption. I studied the problem of merging AST-level patches during my PhD (https://github.com/VictorCMiraldo/hdiff) and can confirm: not simple! :)
5 comments

Please tell me the final output of your PhD was a differtation.
omg!! I really should have left that typo somewhere in there! What a missed opportunity! xD
Should've named that repo "phdiff".
I'll vote for "diphph"
It's tangential but it reminded me of "lighght" poem by Aram Saroyan. https://en.wikipedia.org/wiki/Aram_Saroyan#Minimalism_and_co...
Best pun I've heard in a long time. Well done. <3
To be pronounced "Doctor-iff" in speech?
"Doctor if and only if <this works>"
So I looked at the paper and it seems interesting. Basic idea: Instead of the operations to consider being "insert", "delete" and "copy", one adds "reorder" "contract subtree" and "duplicate" (although I didn't quite get the subtlety of copy vs duplicate on a short skim); and even though extra ops increase the search space, they actually let you search more effectively. I can buy that argument.

The practical problem, though, is that the Haskell compiler is limited/buggy, so you couldn't implement this for C, and you settled on a small language like Lua. If you _do_ extend this to other languages (perhaps port your implementation from Haskell to something else?), please post it on HN and elsewhere!

Some of the GHC performance bugs that we ran into during the research have been fixed as far as I know! Though I'd have to double-check
Indeed, we also designed a brand new generics library to work around that. Performance was really not an issue! :)
Copy just copies once. The need for duplicate is clear if you're trying to diff something like `t = [a]` and `u = [a, a]`. You could copy `a`, but you'd have to decide whether to copy it on the first or second position; the second one would be classified an "insertion" by any ins/del/cpy-algorithm. If you instead opt to NOT make that choice, you can say: pick the source `a` and duplicate it instead
Can you give a little color on where the difficulties lie? Is it an efficiency question, or is determining "which changes" hard in the first place?
Early in the linked thesis there is a one-page argument about the shortcomings of traditional approaches, which technically isn't what you asked but might still answer the side of the question that deals with human usage at least:

https://victorcmiraldo.github.io/data/MiraldoPhD.pdf#page=24

Not OP, but the docs call out some "Tricky Cases" [1].

[1] https://difftastic.wilfred.me.uk/tricky_cases.html

I’d imagine there’s some challenging judgement calls that such a tool would have to make. Like, in Go, you can reorder the members of a struct definition. In many cases this is just diff noise to reviewers. HOWEVER, it does impact the layout of the struct in memory, so it can be semantically meaningful in performance work.
A wild nitpicker appears. I understand where you're coming from & why this matters. But Go, the language spec, doesn't make any guarantees about struct layout at all. A layout difference may be meaningful, practically, but it's potentially unreliable.

e.g. see https://groups.google.com/g/golang-nuts/c/1BlZDNBLiAM

Having said that: if a Go compiler for a given architecture decided to change its layout algorithm, I'm pretty sure it would earn a changelog entry.

PHP long stated that associative array sorting order was unstable and not guaranteed (especially when the union (+) operator or array_merge function were involved) - that doesn't mean ten bazillion websites wouldn't instantly break if they ever actually changed the ordering to be unpredictable.

Language designers need to contend with the fact that the ultimate final say in whether a thing is or not is whether that behavior is observed.

Didn't ruby actually do exactly this though? And it broke a million websites and they changed it back in the next version and have made it explicit ever since? To me that is much stronger evidence than what we think would happen if php did it.
I wrote a masters thesis about the more general problem here (https://tspace.library.utoronto.ca/bitstream/1807/65616/11/Z...).

The tl;dr is that there's an almost infinite number of ways to atomize/conceptualize code into meaningful "units" (to "register" it, in my supervisor's words), and the most appropriate way to do that is largely perspectival — it depends on what you care about after the fact, and there is no single maximal way to do it up front.

I mean to have an improvement over the status quo we need to simply find a conception that works better than lines as units of code. Let’s not let perfect be the enemy of the good.
love to tell someone who literally wrote a masters thesis on a topic what we need to "simply do" to solve it lol. I almost want to admire the confidence but
>>I’d imagine there’s some challenging judgement calls that such a tool would have to make

Just thinking about it makes my head spin. I spend a lot of time working out font/color hierarchies, supplementary to coding and data viz. Arguably what you're bringing up is a case for a carefully colored diff that visually cues whether something is a true semantic change or indicative of a lower level issue. I'm comfortable with reading a plain ol' diff that just shows me what changed, superficially, and interpreting it. While I think OP's idea is awesome, it also might create more confusion than it resolves; and resolving confusion is the point of a diff.

Efficiency is not the issue at this point. My prototype diffing algorithm was linear and there have been improvements on it already (I think something called "truediff" is linear but an order of magnitude better! I could be misremembering the name, don't quote me :) ).

The real difficult part is in how you represent AST-level changes, which will limit what your merging algorithm can do. In particular, working around moving "the same subtree" into different places is difficult. Imagine the following conflict:

([1,3], [4,2,5]) <-- q -- ([1,2,3], [4,5]) -- p --> ([1,3], [2,4,5])

Both p and q move the same thing to different places so they need a human to make a decision about what's the correct merge. Depending on your choice of "what is a change", even detecting this type of conflict will be difficult. And that's because we didn't add insertions nor deletions. Because now, say p was:

([1,2,3], [4,5]) -- p --> ([1,3], [2,5])

One could argue that we can now merge, because '4' was also deleted hence the position in which we insert '2' in the second list is irrelevant.

If we extrapolate from lists of integers to arbitrary ASTs the difficulties become even worse :)

How does your work relate to tree-sitter, which also manages patches which it describes as "incremental parsing" as well as error states.