Hacker News new | ask | show | jobs
by kmike84 2253 days ago
We needed to solve a similar problem - version control & synchronize .json files from different machines (annotations for ML models).

Writing a custom git merge driver was quite painless - a cmdline script (written in Python), which has task-specific logic on how to merge data from these .json files. Load these files, parse them, decide how to combine, detect unresolvable conflicts, etc.

It seems one may need custom logic to merge structured data, there is not a single best solution. This could make creation of a generic tool harder.

git is not a bad base technology for this. I'm not sure what other things are we missing (e.g. better diffs for structured data?), because .json is still text; it is just merges which are unreliable if you treat .json as text. There are also caveats - e.g. you can't install a custom merge driver on github, so "merge" button becomes dangerous. But overall for .json this approach works fine.

2 comments

Have you looked into DVC[1] for versioning the data and pipelines that generate them? I have set up a few versioned dataset repositories with it now and quite like it, especially the ability to simply `dvc import` the versioned data into projects and checkout different versions for testing with various models.

It operates on data at the same level as git but with features needed for large datasets and is totally language and framework agnostic like git.

[1]: https://dvc.org/

We looked into it, but it seems to be solving a different problem - how to handle large data. Does it solve merging of structured data?

E.g. a json file is chanaged on 2 machines, and you need to merge the changes. Sometimes you can merge (e.g. 2 different entries in an array where people are adding annotations), sometimes you need to raise an error - e.g. changes in a single record, but for different fields - depending on a problem, you may disallow it, to keep the record consistent.

I wanted to write a git diff for files like KiCad or even Word. I didn’t know custom git merges were a thing. Do you have a link for how to get started?
Both custom diff and merge drivers are described at a high-level in gitattributes(5)¹. They're pretty useful even in really basic ways such as adding a textconv with "jq -S ." or "xmllint --pretty 2" to pretty print JSON or XML before calculating diffs.

Plus, if you've already dipped in to those docs to see the diff options be sure to check the funcname attribute too. It allows you to add custom diff(1)-style `--show-function-line` options. For example, you can use an ugly regex such as `^\\[\\(.*\\)\\]$` to guess section names in .ini file diffs. Or the wordRegex option to make CSV files break on fields with `git diff --word-diff`. Or... well, thousands of other things. There are tonnes of things you can do to improve diff and merges for textual data in addition to the things you may want to do binary blobs.

1. https://git-scm.com/docs/gitattributes

The API is quite simple - you need to implement a script which takes 3 arguments, writes a result of a merge to a file, and exits with non-zero status code in case of merge error. Quote from https://git-scm.com/docs/gitattributes#_defining_a_custom_me...:

To define a custom merge driver filfre, add a section to your $GIT_DIR/config file (or $HOME/.gitconfig file) like this

  [merge "filfre"]
    name = feel-free merge driver
    driver = filfre %O %A %B %L %P
    recursive = binary
The merge.?.name variable gives the driver a human-readable name.

The merge.?.driver variable’s value is used to construct a command to run to merge ancestor’s version (%O), current version (%A) and the other branches' version (%B). These three tokens are replaced with the names of temporary files that hold the contents of these versions when the command line is built. Additionally, %L will be replaced with the conflict marker size (see below).

The merge driver is expected to leave the result of the merge in the file named with %A by overwriting it, and exit with zero status if it managed to merge them cleanly, or non-zero if there were conflicts.

For Word documents I had some luck storing the unzipped contents of the file (since a DOCX is mostly XML files in a ZIP container). My approach was automating the zip/unzip process (and some cleanup steps) pre-commit and post-checkout. https://github.com/WorldMaker/musdex

Though also specifically for Word files your best bet might be to launch Word's own compare-files GUI tools as a merge engine, but I had several reasons at the time to explore a "container destructuring tool" for source control.