| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vcmiraldo 1543 days ago
	I really like the idea of focusing on producing patches for human consumption. I studied the problem of merging AST-level patches during my PhD (https://github.com/VictorCMiraldo/hdiff) and can confirm: not simple! :)

5 comments

stavros 1543 days ago

Please tell me the final output of your PhD was a differtation.

link

vcmiraldo 1542 days ago

omg!! I really should have left that typo somewhere in there! What a missed opportunity! xD

link

bool3max 1543 days ago

Should've named that repo "phdiff".

link

Groxx 1543 days ago

I'll vote for "diphph"

link

wst_ 1543 days ago

It's tangential but it reminded me of "lighght" poem by Aram Saroyan. https://en.wikipedia.org/wiki/Aram_Saroyan#Minimalism_and_co...

link

pdimitar 1543 days ago

Best pun I've heard in a long time. Well done. <3

link

munk-a 1543 days ago

To be pronounced "Doctor-iff" in speech?

link

MonkeyClub 1543 days ago

"Doctor if and only if <this works>"

link

einpoklum 1543 days ago

So I looked at the paper and it seems interesting. Basic idea: Instead of the operations to consider being "insert", "delete" and "copy", one adds "reorder" "contract subtree" and "duplicate" (although I didn't quite get the subtlety of copy vs duplicate on a short skim); and even though extra ops increase the search space, they actually let you search more effectively. I can buy that argument.

The practical problem, though, is that the Haskell compiler is limited/buggy, so you couldn't implement this for C, and you settled on a small language like Lua. If you _do_ extend this to other languages (perhaps port your implementation from Haskell to something else?), please post it on HN and elsewhere!

link

arianvanp 1543 days ago

Some of the GHC performance bugs that we ran into during the research have been fixed as far as I know! Though I'd have to double-check

link

vcmiraldo 1542 days ago

Indeed, we also designed a brand new generics library to work around that. Performance was really not an issue! :)

link

vcmiraldo 1542 days ago

Copy just copies once. The need for duplicate is clear if you're trying to diff something like `t = [a]` and `u = [a, a]`. You could copy `a`, but you'd have to decide whether to copy it on the first or second position; the second one would be classified an "insertion" by any ins/del/cpy-algorithm. If you instead opt to NOT make that choice, you can say: pick the source `a` and duplicate it instead

link

narush 1543 days ago

Can you give a little color on where the difficulties lie? Is it an efficiency question, or is determining "which changes" hard in the first place?

link

vanderZwan 1543 days ago

Early in the linked thesis there is a one-page argument about the shortcomings of traditional approaches, which technically isn't what you asked but might still answer the side of the question that deals with human usage at least:

https://victorcmiraldo.github.io/data/MiraldoPhD.pdf#page=24

link

scythmic_waves 1543 days ago

Not OP, but the docs call out some "Tricky Cases" [1].

[1] https://difftastic.wilfred.me.uk/tricky_cases.html

link

teeray 1543 days ago

I’d imagine there’s some challenging judgement calls that such a tool would have to make. Like, in Go, you can reorder the members of a struct definition. In many cases this is just diff noise to reviewers. HOWEVER, it does impact the layout of the struct in memory, so it can be semantically meaningful in performance work.

link

gmfawcett 1543 days ago

A wild nitpicker appears. I understand where you're coming from & why this matters. But Go, the language spec, doesn't make any guarantees about struct layout at all. A layout difference may be meaningful, practically, but it's potentially unreliable.

e.g. see https://groups.google.com/g/golang-nuts/c/1BlZDNBLiAM

Having said that: if a Go compiler for a given architecture decided to change its layout algorithm, I'm pretty sure it would earn a changelog entry.

link

munk-a 1543 days ago

PHP long stated that associative array sorting order was unstable and not guaranteed (especially when the union (+) operator or array_merge function were involved) - that doesn't mean ten bazillion websites wouldn't instantly break if they ever actually changed the ordering to be unpredictable.

Language designers need to contend with the fact that the ultimate final say in whether a thing is or not is whether that behavior is observed.

link

giraffe_lady 1543 days ago

Didn't ruby actually do exactly this though? And it broke a million websites and they changed it back in the next version and have made it explicit ever since? To me that is much stronger evidence than what we think would happen if php did it.

link

zukzuk 1543 days ago

I wrote a masters thesis about the more general problem here (https://tspace.library.utoronto.ca/bitstream/1807/65616/11/Z...).

The tl;dr is that there's an almost infinite number of ways to atomize/conceptualize code into meaningful "units" (to "register" it, in my supervisor's words), and the most appropriate way to do that is largely perspectival — it depends on what you care about after the fact, and there is no single maximal way to do it up front.

link

ummonk 1543 days ago

I mean to have an improvement over the status quo we need to simply find a conception that works better than lines as units of code. Let’s not let perfect be the enemy of the good.

link

giraffe_lady 1543 days ago

love to tell someone who literally wrote a masters thesis on a topic what we need to "simply do" to solve it lol. I almost want to admire the confidence but

link

noduerme 1543 days ago

>>I’d imagine there’s some challenging judgement calls that such a tool would have to make

Just thinking about it makes my head spin. I spend a lot of time working out font/color hierarchies, supplementary to coding and data viz. Arguably what you're bringing up is a case for a carefully colored diff that visually cues whether something is a true semantic change or indicative of a lower level issue. I'm comfortable with reading a plain ol' diff that just shows me what changed, superficially, and interpreting it. While I think OP's idea is awesome, it also might create more confusion than it resolves; and resolving confusion is the point of a diff.

link

vcmiraldo 1542 days ago

Efficiency is not the issue at this point. My prototype diffing algorithm was linear and there have been improvements on it already (I think something called "truediff" is linear but an order of magnitude better! I could be misremembering the name, don't quote me :) ).

The real difficult part is in how you represent AST-level changes, which will limit what your merging algorithm can do. In particular, working around moving "the same subtree" into different places is difficult. Imagine the following conflict:

([1,3], [4,2,5]) <-- q -- ([1,2,3], [4,5]) -- p --> ([1,3], [2,4,5])

Both p and q move the same thing to different places so they need a human to make a decision about what's the correct merge. Depending on your choice of "what is a change", even detecting this type of conflict will be difficult. And that's because we didn't add insertions nor deletions. Because now, say p was:

([1,2,3], [4,5]) -- p --> ([1,3], [2,5])

One could argue that we can now merge, because '4' was also deleted hence the position in which we insert '2' in the second list is irrelevant.

If we extrapolate from lists of integers to arbitrary ASTs the difficulties become even worse :)

link

infogulch 1542 days ago

How does your work relate to tree-sitter, which also manages patches which it describes as "incremental parsing" as well as error states.

link